Chroma Document Store for Retrieval Augmented Generation (RAG)
When implementing Retrieval Augmented Generation (RAG), a robust document store is crucial. This guide demonstrates how to leverage a Chroma database as the document store.
Leveraging the Chroma Document Store
To make use of the Chroma document store, you’ll need to include the following dependency:
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-chroma</artifactId>
<version>0.23.0.CR1</version>
</dependency>
This extension includes a dev service. Therefore, if you’re operating in a container environment, a Chroma instance will automatically start in dev and test mode.
Upon installing the extension, you can use the Chroma document store with the following code:
package io.quarkiverse.langchain4j.samples;
import static dev.langchain4j.data.document.splitter.DocumentSplitters.recursive;
import java.util.List;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;
import dev.langchain4j.data.document.Document;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStoreIngestor;
import io.quarkiverse.langchain4j.chroma.ChromaEmbeddingStore;
@ApplicationScoped
public class IngestorExampleWithChroma {
/**
* The embedding store (the database).
* The bean is provided by the quarkus-langchain4j-chroma extension.
*/
@Inject
ChromaEmbeddingStore store;
/**
* The embedding model (how is computed the vector of a document).
* The bean is provided by the LLM (like openai) extension.
*/
@Inject
EmbeddingModel embeddingModel;
public void ingest(List<Document> documents) {
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.embeddingStore(store)
.embeddingModel(embeddingModel)
.documentSplitter(recursive(500, 0))
.build();
// Warning - this can take a long time...
ingestor.ingest(documents);
}
}
Configuration Settings
Customize the behavior of the extension by exploring various configuration options:
Configuration property fixed at build time - All other configuration properties are overridable at runtime
Configuration property |
Type |
Default |
---|---|---|
If DevServices has been explicitly enabled or disabled. DevServices is generally enabled by default, unless there is an existing configuration present. When DevServices is enabled Quarkus will attempt to automatically configure and start a database when running in Dev or Test mode and when Docker is running. Environment variable: |
boolean |
|
The container image name to use, for container based DevServices providers. If you want to use Redis Stack modules (bloom, graph, search…), use: Environment variable: |
string |
|
Optional fixed port the dev service will listen to. If not defined, the port will be chosen randomly. Environment variable: |
int |
|
Indicates if the Redis server managed by Quarkus Dev Services is shared. When shared, Quarkus looks for running containers using label-based service discovery. If a matching container is found, it is used, and so a second one is not started. Otherwise, Dev Services for Redis starts a new container. The discovery uses the Container sharing is only used in dev mode. Environment variable: |
boolean |
|
The value of the This property is used when you need multiple shared Chroma servers. Environment variable: |
string |
|
Environment variables that are passed to the container. Environment variable: |
Map<String,String> |
|
URL where the Chroma database is listening for requests Environment variable: |
string |
required |
The collection name. Environment variable: |
string |
|
The timeout duration for the Chroma client. If not specified, 5 seconds will be used. Environment variable: |
||
Whether requests to Chroma should be logged Environment variable: |
boolean |
|
Whether responses from Chroma should be logged Environment variable: |
boolean |
|
About the Duration format
To write duration values, use the standard You can also use a simplified format, starting with a number:
In other cases, the simplified format is translated to the
|