Document Retrieval for Language Models

Many applications involving Large Language Models (LLMs) often require user-specific data beyond their training set, such as CSV files, data from various sources, or reports. To achieve this, the process of Retrieval Augmented Generation (RAG) is commonly employed.

Understanding the Data Ingestion Journey

Before delving into the RAG process, it’s crucial to delineate two distinct phases:

The Data Ingestion Process

Involves data collection, cleaning, transformation, and adding metadata, resulting in a set of vectorized documents stored in a database.

The RAG Process

Before engaging with the LLM, it seeks relevant documents in the database and passes them for model augmentation.

chatbot architecture

Unveiling Embeddings

Embeddings denote data representations, typically in a lower-dimensional space (arrays of floats), preserving essential characteristics of the original data. In the realm of language or text, word embeddings represent words as numerical vectors in a continuous space.

Why Use Vectors/Embeddings?
Efficiently comparing documents to user queries during the RAG process, a crucial step in finding relevant documents, relies on computing similarity (or distance) between documents and queries. Vectorizing documents significantly speeds up this process.

The Ingestion Process

The ingestion process varies based on the data source, operating as a one-shot or continuous process, possibly within a data streaming architecture where each message is treated as a document.

To illustrate document creation, the following code exemplifies creating a Document from a text file:

package io.quarkiverse.langchain4j.samples;

import static dev.langchain4j.data.document.loader.FileSystemDocumentLoader.loadDocument;

import java.io.File;

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.parser.TextDocumentParser;

public class DocumentFromTextCreationExample {

    Document createDocument(File file) {
        return loadDocument(file.toPath(), new TextDocumentParser());
    }
}

A more complex scenario involves creating a Document from a CSV line:

package io.quarkiverse.langchain4j.samples;

import java.util.HashMap;
import java.util.List;
import java.util.Map;

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.Metadata;

public class DocumentCreationExample {

    Document createFromCSVLine(List<String> headers, List<String> line,
            List<String> columnsToIncludeInMetadata) {
        Map<String, String> metadata = new HashMap<>();
        var content = new StringBuilder();
        for (int i = 0; i < headers.size(); i++) {
            var columnName = headers.get(i);
            var value = line.get(i).trim();

            if (columnName.trim().isEmpty()) {
                continue;
            }

            if (columnsToIncludeInMetadata.contains(columnName)) {
                metadata.put(columnName, value);
            }
            // We compute a Text format for the CSV line: key: value, key: value, ...
            content.append(columnName).append(": ").append(value).append(", ");
        }
        // The \n is added to the end of the content
        return new Document(content.append("\n").toString(), Metadata.from(metadata));
    }

}

Following document creation, the documents need to be ingested. The Quarkus LangChain4j extension offers ingestor components for database storage. For instance, quarkus-langchain4j-redis stores data in a Redis database, while quarkus-langchain4j-chroma uses a Chroma database.

The following code demonstrates document ingestion in a Redis database:

package io.quarkiverse.langchain4j.samples;

import static dev.langchain4j.data.document.splitter.DocumentSplitters.recursive;

import java.util.List;

import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;

import dev.langchain4j.data.document.Document;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStoreIngestor;
import io.quarkiverse.langchain4j.redis.RedisEmbeddingStore;

@ApplicationScoped
public class IngestorExampleWithRedis {

    /**
     * The embedding store (the database).
     * The bean is provided by the quarkus-langchain4j-redis extension.
     */
    @Inject
    RedisEmbeddingStore store;

    /**
     * The embedding model (how is computed the vector of a document).
     * The bean is provided by the LLM (like openai) extension.
     */
    @Inject
    EmbeddingModel embeddingModel;

    public void ingest(List<Document> documents) {
        EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
                .embeddingStore(store)
                .embeddingModel(embeddingModel)
                .documentSplitter(recursive(500, 0))
                .build();
        // Warning - this can take a long time...
        ingestor.ingest(documents);
    }
}

Adjust the documentSplitter parameter based on the data structure. For instance, for CSV files with document representation separated by \n, new DocumentByLineSplitter(500, 0) is a recommended starting point.

Retrieval Augmented Generation (RAG)

Once documents are ingested, they can augment the LLM’s capabilities. The following code illustrates the creation of a RetrievalAugmentor:

package io.quarkiverse.langchain4j.samples;

import java.util.function.Supplier;

import jakarta.enterprise.context.ApplicationScoped;

import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.rag.DefaultRetrievalAugmentor;
import dev.langchain4j.rag.RetrievalAugmentor;
import dev.langchain4j.rag.content.retriever.EmbeddingStoreContentRetriever;
import io.quarkiverse.langchain4j.redis.RedisEmbeddingStore;

@ApplicationScoped
public class RetrievalAugmentorExample implements Supplier<RetrievalAugmentor> {

    private final RetrievalAugmentor augmentor;

    RetrievalAugmentorExample(RedisEmbeddingStore store, EmbeddingModel model) {
        EmbeddingStoreContentRetriever contentRetriever = EmbeddingStoreContentRetriever.builder()
                .embeddingModel(model)
                .embeddingStore(store)
                .maxResults(3)
                .build();
        augmentor = DefaultRetrievalAugmentor
                .builder()
                .contentRetriever(contentRetriever)
                .build();
    }

    @Override
    public RetrievalAugmentor get() {
        return augmentor;
    }

}
This is the simplest example of retrieval augmentor, which only uses a EmbeddingStoreContentRetriever to retrieve documents from an embedding store to pass them directly to the LLM. A retrieval augmentor can use more sophisticated strategies to process queries, such as query compression, splitting a query into multiple queries and then routing them via different content retrievers (which may or may not be based on vector storage, but for example on a full-text search engine), using a scoring model to further filter the retrieved results, etc. For more information about advanced RAG strategies, refer to https://docs.langchain4j.dev/tutorials/rag/.

The example above is a CDI bean that implements Supplier<RetrievalAugmentor>. An alternative way to wire things up is to create a CDI bean that directly implements RetrievalAugmentor, for example via a CDI producer, and letting Quarkus auto-discover it (by not specifying the retrievalAugmentor parameter of the @RegisterAiService annotation).

The EmbeddingStoreContentRetriever necessitates a configured embedding store (Redis, Chroma, etc.) and an embedding model. Configure the maximum number of documents to retrieve (e.g., 3 in the example) and set the minimum relevance score if required.

Make sure that the number of documents is not too high (or document too large). More document you have, more data you are adding to the LLM context, and you may exceed the limit.

An AI service does not use a retrieval augmentor by default, one needs to be configured explicitly via the retrievalAugmentor property of @RegisterAiService and the configured Supplier<RetrievalAugmentor> is expected to be a CDI bean.