Loading CSV files
When working with the Retrieval Augmented Generation (RAG) model, it is often necessary to load tabular data, such as a CSV file. This guide provides recommendations for loading CSV files in a way that is compatible with the RAG model.
When loading a CSV file, the process involves:
-
Transforming each row into a document.
-
Ingesting the set of documents using an appropriate document splitter.
-
Storing the documents in the database.
You can find a complete example in the GitHub Repository.
From CSV to Documents
There are multiple ways to load CSV files in Java. In this example, we use the following dependencies:
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-csv</artifactId>
<version>1.10.0</version>
</dependency>
You can choose a different library; the APIs are similar enough.
Once you have the dependency, load the CSV and process the rows:
/**
* The CSV file to load.
*/
@ConfigProperty(name = "csv.file")
File file;
/**
* The CSV file headers.
* Some libraries provide an API to extract them.
*/
@ConfigProperty(name = "csv.headers")
List<String> headers;
/**
* Ingest the CSV file.
* This method is executed when the application starts.
*/
public void ingest(@Observes StartupEvent event) throws IOException {
// Configure the CSV format.
CSVFormat csvFormat = CSVFormat.DEFAULT.builder()
.setHeader(headers.toArray(new String[0]))
.setSkipHeaderRecord(true)
.build();
// This will be the resulting list of documents:
List<Document> documents = new ArrayList<>();
try (Reader reader = new FileReader(file)) {
// Generate one document per row, using the specified syntax.
Iterable<CSVRecord> records = csvFormat.parse(reader);
int i = 1;
for (CSVRecord record : records) {
Map<String, String> metadata = new HashMap<>();
metadata.put("source", file.getAbsolutePath());
metadata.put("row", String.valueOf(i++));
StringBuilder content = new StringBuilder();
for (String header : headers) {
// Include all headers in the metadata.
metadata.put(header, record.get(header));
content.append(header).append(": ").append(record.get(header)).append("\n");
}
documents.add(new Document(content.toString(), Metadata.from(metadata)));
}
// ...
}
Ingesting the Documents
Once you have the list of documents, they need to be ingested. For this, use a document splitter. We recommend the recurve
splitter, a simple splitter that divides the document into chunks of a given size. While it may not be the most suitable splitter for your use case, it serves as a good starting point.
var ingestor = EmbeddingStoreIngestor.builder()
.embeddingStore(store) // Injected
.embeddingModel(embeddingModel) // Injected
.documentSplitter(recursive(500, 0))
.build();
ingestor.ingest(documents);
Implementing the Retrieval augmentor
With the documents ingested, you can now implement the retrieval augmentor:
package io.quarkiverse.langchain4j.sample.chatbot;
import java.util.function.Supplier;
import jakarta.enterprise.context.ApplicationScoped;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.rag.DefaultRetrievalAugmentor;
import dev.langchain4j.rag.RetrievalAugmentor;
import dev.langchain4j.rag.content.retriever.EmbeddingStoreContentRetriever;
import dev.langchain4j.store.embedding.EmbeddingStore;
@ApplicationScoped
public class AugmentorExample implements Supplier<RetrievalAugmentor> {
private final EmbeddingStoreContentRetriever retriever;
AugmentorExample(EmbeddingStore store, EmbeddingModel model) {
retriever = EmbeddingStoreContentRetriever.builder()
.embeddingModel(model)
.embeddingStore(store)
.maxResults(10)
.build();
}
@Override
public RetrievalAugmentor get() {
return DefaultRetrievalAugmentor.builder()
.contentRetriever(retriever)
.build();
}
}