Contextual Retrieval-Augmented Generation (RAG)

Overview

Contextual RAG enhances the standard Retrieval-Augmented Generation (RAG) pipeline by introducing dynamic and modular components such as:

  • Query compression

  • Multi-retriever support

  • Query routing

  • Content scoring and re-ranking

  • Custom injection logic

This approach enhances precision, scalability, and alignment with user intent in AI-assisted workflows. Quarkus LangChain4j fully supports contextual RAG through modular component APIs and CDI-based configuration, making it easy to tailor retrieval behavior to your application’s needs.

Contextual RAG

The diagram above illustrates the high-level flow of contextual RAG. It shows how user queries can be transformed, routed to one or more retrievers, filtered, and then injected into the LLM prompt for a response.

Component Overview

Component Description

Query Transformer

Rewrites or compresses the input query before retrieval

Query Router

Dynamically selects which retriever(s) to invoke

Retriever

Fetches contextually relevant content (e.g., vector DB, web search)

Aggregator

Scores, ranks, filters, or deduplicates content

Injector

Injects the retrieved content into the user’s prompt

Implementing Contextual RAG in Quarkus

Each contextual RAG component can be implemented individually and combined using DefaultRetrievalAugmentor.builder().

1. Query Transformer and Query Compression

A QueryTransformer modifies or rewrites the input query before retrieval. This is particularly useful when user queries are verbose, contain filler words, or require normalization to match indexed document formats.

One common use case is query compression, where a ChatModel is used to rephrase the query concisely, preserving its intent while improving retrieval accuracy.

You can access the default query transformer as follows:

public QueryTransformer getDefaultQueryTransformer(ChatModel chatModel) {
    return new CompressingQueryTransformer(chatModel);
}

You can also create a custom implementation if you want more control:

public QueryTransformer createQueryTransformer(ChatModel chatModel) {
    return question -> {
        String compressed = chatModel.chat("Summarize this: " + question.text());
        return List.of(Query.from(compressed));
    };
}

Using a transformer ensures your retrievers work with more focused queries, improving both latency and relevance.

2. Multiple Retrievers

In general, the RAG uses a vector store to retrieve relevant content based on the query embedding. But it’s not the only source of information. You can combine multiple sources, such as:

  • vector store (e.g., PGVector, Chroma, Weaviate)

  • web search (e.g., Tavily, Bing, Google)

  • knowledge bases (e.g., SQL, NoSQL databases)

  • full-text search engines (e.g., Elasticsearch, OpenSearch)

A content retriever is a component that fetches relevant content based on a query:

public interface ContentRetriever {

    /**
     * Retrieves relevant {@link Content}s using a given {@link Query}.
     * The {@link Content}s are sorted by relevance, with the most relevant {@link Content}s appearing
     * at the beginning of the returned {@code List<Content>}.
     *
     * @param query The {@link Query} to use for retrieval.
     * @return A list of retrieved {@link Content}s.
     */
    List<Content> retrieve(Query query);
}

So you can provide many implementations of ContentRetriever to cover different data sources.

For some of them, Quarkus LangChain4j provides built-in implementations, such as:

public List<ContentRetriever> createContentRetrievers(EmbeddingModel model,
        EmbeddingStore<TextSegment> store,
        WebSearchEngine webSearchEngine) {
    ContentRetriever vectorRetriever = EmbeddingStoreContentRetriever.builder()
            .embeddingModel(model)
            .embeddingStore(store)
            .maxResults(3)
            .build();

    ContentRetriever webRetriever = WebSearchContentRetriever.builder()
            .webSearchEngine(webSearchEngine)
            .build();

    return List.of(vectorRetriever, webRetriever);
}
A Content Retriever implementation can use an LLM to transform the query before performing the actual retrieval, like transforming a query into an SQL query or a web search query.

3. Query Router

The query router dynamically selects which retriever(s) to invoke based on the query content or context.

The default router will use all available retrievers, but you can create a custom one to route queries based on their content or type.

public QueryRouter createQueryRouter(ChatModel chatModel,
        ContentRetriever vectorSearchRetriever,
        ContentRetriever webSearchRetriever) {
    // Create a query router that uses a chat model to route queries
    return LanguageModelQueryRouter.builder()
            .chatModel(chatModel)
            .retrieverToDescription(Map.of(
                    vectorSearchRetriever, "Retrieve relevant documents about terms and conditions",
                    webSearchRetriever, "Retrieve relevant documents from the quarkus.io website"))
            .build();
}

Smarter routers can use a ChatModel to determine which retriever(s) to use based on the query content or context:

public QueryRouter createQueryRouter(ChatModel chatModel,
        ContentRetriever vectorSearchRetriever,
        ContentRetriever webSearchRetriever) {
    // Create a query router that uses a chat model to route queries
    return LanguageModelQueryRouter.builder()
            .chatModel(chatModel)
            .retrieverToDescription(Map.of(
                    vectorSearchRetriever, "Retrieve relevant documents about terms and conditions",
                    webSearchRetriever, "Retrieve relevant documents from the quarkus.io website"))
            .build();
}

You can also add a fallback if no retriever is selected:

public QueryRouter createQueryRouterWithFallback(ChatModel chatModel,
        ContentRetriever vectorSearchRetriever,
        ContentRetriever webSearchRetriever) {
    // Create a query router that uses a chat model to route queries
    return LanguageModelQueryRouter.builder()
            .chatModel(chatModel)
            .fallbackStrategy(LanguageModelQueryRouter.FallbackStrategy.ROUTE_TO_ALL)
            .retrieverToDescription(Map.of(
                    vectorSearchRetriever, "Retrieve relevant documents about terms and conditions",
                    webSearchRetriever, "Retrieve relevant documents from the quarkus.io website"))
            .build();
}

4. Content Aggregation and Reranking

When multiple retrievers are used, you may want to score and filter the results to ensure only the most relevant content is included in the final prompt. You can implement a custom ContentAggregator to score, (re-)rank, and filter the retrieved content.

public ContentAggregator createAggregator(ScoringModel model) {
    return ReRankingContentAggregator.builder()
            .scoringModel(model)
            .minScore(0.6)
            .build();
}

Reranking needs a scoring model, which can be implemented using a ChatModel or an EmbeddingModel that computes similarity scores between the query and content, or be a custom scoring function.

You can also combine all result without reranking (the order may not be deterministic):

ContentAggregator aggregator = new DefaultContentAggregator();

Reranking is covered in more detail in Reranking using scoring models.

5. Custom Content Injection

Control how the retrieved content is added to the final prompt:

public ContentInjector createContentInjector() {
    return (contents, userMessage) -> {
        StringBuilder prompt = new StringBuilder(userMessage.toString() + "\n\nRelevant facts:\n");
        for (Content content : contents) {
            prompt.append("- ").append(content.textSegment().text()).append("\n");
        }
        return UserMessage.userMessage(prompt.toString());
    };
}

You can use a ContentInjector to customize how the retrieved content is injected into the prompt.

Implementing your own injector can be useful to later identify which part of the prompt has been augmented with RAG content and cleanup the memory to avoid out of date information.

6. Final Assembly

@Produces
@ApplicationScoped
public RetrievalAugmentor createRetrievalAugmentor(
        ChatModel chatModel, // Default chat model, you can also use @ModelName to select it
        ScoringModel scoringModel) {
    return DefaultRetrievalAugmentor.builder()
            .queryTransformer(getDefaultQueryTransformer(chatModel))
            .queryRouter(createQueryRouter(chatModel, getVectorSearchRetriever(), getWebSearchRetriever()))
            .contentAggregator(createAggregator(scoringModel))
            .contentInjector(createContentInjector())
            .build();
}

Best Practices

  • Keep transformer logic lightweight to avoid latency

  • Use metadata filters to constrain search domains (e.g., by language or topic)

  • Ensure retrievers use consistent chunking and embedding settings as ingestion

  • Tune scoring thresholds (e.g., minScore) for balance between recall and relevance

  • Implement timeouts or fallbacks for web-based retrievers

  • Write unit tests for injectors to confirm the prompt shape and guard against prompt injection

Summary

Contextual RAG lets you:

  • Dynamically transform queries

  • Combine multiple sources (e.g., vector DB, web search)

  • Select and score content at runtime

  • Customize how prompts are constructed

It adds significant control and accuracy to your RAG workflows in Quarkus LangChain4j.