Retrieval-Augmented Generation (RAG)

Overview

Retrieval-Augmented Generation (RAG) is a technique used to enhance the performance of Large Language Models (LLMs) by augmenting the prompt with external, (domain-specific) information retrieved before sending the prompt for inference. This allows models to reason over information they were not originally trained on, making it ideal for enterprise scenarios with private or constantly evolving data.

Figure 1. Example of RAG implementation

Why RAG?

LLMs are trained on large corpus of data but are inherently static and limited by their training data and cutoff date. RAG bridges this gap by injecting relevant content from trusted sources into the prompt, increasing reliability and accuracy without retraining the model.

Benefits of RAG:

Reduces hallucinations
Keeps answers up to date
Improves answer precision using curated data
Maintains a smaller context window by injecting only relevant information

Core Principles

At a high level, a RAG system follows this two-phase process:

Ingestion: Documents are ingested, segmented (chunked), embedded into vectors, and stored in a vector store.
Retrieval: At query time, the question is embedded, relevant segments are retrieved via similarity search, and then sent to the LLM along with the query.

End-to-End Flow

flowchart TB
  A[Documents] --> B[Text Splitter]
  B --> C[Text Segments]
  C --> D[Embedding Computation]
  D --> E[Embeddings]
  E --> F[Vector Store]

  G[Query] --> H[Query Embedding]
  H --> I[Find Relevant Segments]
  I --> J[Relevant Chunks]
  J --> K[Append to Prompt]
  K --> L[LLM Response]

Key Concepts

These concepts are foundational to any RAG-based application:

Concept	Description
Text Splitter	Divides documents into smaller overlapping or non-overlapping segments for processing.
Embedding	A dense vector representation of a text segment computed by an embedding model.
Embedding Model	A model trained to convert text into embeddings preserving semantic meaning.
Vector Store	A database that allows similarity search over embeddings (e.g., PGVector, Chroma, Weaviate).
Query Embedding	The embedding computed from the user’s question.
Retriever	A component responsible for finding relevant segments given a query embedding.
Augmentor	Optional logic used to enhance the retrieved content (e.g., filtering, re-ranking, transformation).
Prompt Injection	Final step of merging query and retrieved content into a prompt for the model.

Concept

Description

Text Splitter

Divides documents into smaller overlapping or non-overlapping segments for processing.

Embedding

A dense vector representation of a text segment computed by an embedding model.

Embedding Model

A model trained to convert text into embeddings preserving semantic meaning.

Vector Store

A database that allows similarity search over embeddings (e.g., PGVector, Chroma, Weaviate).

Query Embedding

The embedding computed from the user’s question.

Retriever

A component responsible for finding relevant segments given a query embedding.

Augmentor

Optional logic used to enhance the retrieved content (e.g., filtering, re-ranking, transformation).

Prompt Injection

Final step of merging query and retrieved content into a prompt for the model.

Visual Overview

Figure 2. RAG ingestion pipeline

Figure 3. RAG query-time augmentation pipeline

What’s Next?

This page introduces the fundamental concepts of RAG. To go deeper, visit the following sections:

Ingestion Pipeline
Query-Time Augmentation
EasyRAG
Contextual RAG (Advanced Techniques)
You can see supported document and vector stores in the menu