Loading CSV files

When working with the Retrieval Augmented Generation (RAG) model, it is often necessary to load tabular data, such as a CSV file. This guide provides recommendations for loading CSV files in a way that is compatible with the RAG model.

When loading a CSV file, the process involves:

  1. Transforming each row into a document.

  2. Ingesting the set of documents using an appropriate document splitter.

  3. Storing the documents in the database.

You can find a complete example in the GitHub Repository.

From CSV to Documents

There are multiple ways to load CSV files in Java. In this example, we use the following dependencies:

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-csv</artifactId>
    <version>1.10.0</version>
</dependency>

You can choose a different library; the APIs are similar enough.

Once you have the dependency, load the CSV and process the rows:

/**
 * The CSV file to load.
 */
@ConfigProperty(name = "csv.file")
File file;

/**
 * The CSV file headers.
 * Some libraries provide an API to extract them.
 */
@ConfigProperty(name = "csv.headers")
List<String> headers;

/**
 * Ingest the CSV file.
 * This method is executed when the application starts.
 */
public void ingest(@Observes StartupEvent event) throws IOException {
    // Configure the CSV format.
    CSVFormat csvFormat = CSVFormat.DEFAULT.builder()
            .setHeader(headers.toArray(new String[0]))
            .setSkipHeaderRecord(true)
            .build();
    // This will be the resulting list of documents:
    List<Document> documents = new ArrayList<>();

    try (Reader reader = new FileReader(file)) {
        // Generate one document per row, using the specified syntax.
        Iterable<CSVRecord> records = csvFormat.parse(reader);
        int i = 1;
        for (CSVRecord record : records) {
            Map<String, String> metadata = new HashMap<>();
            metadata.put("source", file.getAbsolutePath());
            metadata.put("row", String.valueOf(i++));

            StringBuilder content = new StringBuilder();
            for (String header : headers) {
                // Include all headers in the metadata.
                metadata.put(header, record.get(header));
                content.append(header).append(": ").append(record.get(header)).append("\n");
            }
            documents.add(new Document(content.toString(), Metadata.from(metadata)));
        }
        // ...
}

Ingesting the Documents

Once you have the list of documents, they need to be ingested. For this, use a document splitter. We recommend the recurve splitter, a simple splitter that divides the document into chunks of a given size. While it may not be the most suitable splitter for your use case, it serves as a good starting point.

var ingestor = EmbeddingStoreIngestor.builder()
        .embeddingStore(store) // Injected
        .embeddingModel(embeddingModel) // Injected
        .documentSplitter(recursive(500, 0))
        .build();
ingestor.ingest(documents);

Implementing the Retriever

With the documents ingested, you can now implement the retriever:

package io.quarkiverse.langchain4j.sample.chatbot;

import java.util.List;

import jakarta.enterprise.context.ApplicationScoped;

import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.retriever.EmbeddingStoreRetriever;
import dev.langchain4j.retriever.Retriever;
import io.quarkiverse.langchain4j.redis.RedisEmbeddingStore;

@ApplicationScoped
public class RetrieverExample implements Retriever<TextSegment> {

    private final EmbeddingStoreRetriever retriever;

    RetrieverExample(RedisEmbeddingStore store, EmbeddingModel model) {
        // Limit the number of documents to avoid exceeding the context size.
        retriever = EmbeddingStoreRetriever.from(store, model, 10);
    }

    @Override
    public List<TextSegment> findRelevant(String s) {
        return retriever.findRelevant(s);
    }
}