Chroma Document Store for Retrieval Augmented Generation (RAG)

When implementing Retrieval Augmented Generation (RAG), a robust document store is crucial. This guide demonstrates how to leverage a Chroma database as the document store.

Leveraging the Chroma Document Store

To make use of the Chroma document store, you’ll need to include the following dependency:

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-chroma</artifactId>
    <version>1.0.1</version>
</dependency>

This extension includes a dev service. Therefore, if you’re operating in a container environment, a Chroma instance will automatically start in dev and test mode.

Upon installing the extension, you can use the Chroma document store with the following code:

package io.quarkiverse.langchain4j.samples;

import static dev.langchain4j.data.document.splitter.DocumentSplitters.recursive;

import java.util.List;

import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;

import dev.langchain4j.data.document.Document;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStoreIngestor;
import io.quarkiverse.langchain4j.chroma.ChromaEmbeddingStore;

@ApplicationScoped
public class IngestorExampleWithChroma {

    /**
     * The embedding store (the database).
     * The bean is provided by the quarkus-langchain4j-chroma extension.
     */
    @Inject
    ChromaEmbeddingStore store;

    /**
     * The embedding model (how is computed the vector of a document).
     * The bean is provided by the LLM (like openai) extension.
     */
    @Inject
    EmbeddingModel embeddingModel;

    public void ingest(List<Document> documents) {
        EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
                .embeddingStore(store)
                .embeddingModel(embeddingModel)
                .documentSplitter(recursive(500, 0))
                .build();
        // Warning - this can take a long time...
        ingestor.ingest(documents);
    }
}

Configuration Settings

Customize the behavior of the extension by exploring various configuration options:

Configuration property fixed at build time - All other configuration properties are overridable at runtime

Configuration property	Type	Default
`quarkus.langchain4j.chroma.devservices.enabled` If DevServices has been explicitly enabled or disabled. DevServices is generally enabled by default, unless there is an existing configuration present. When DevServices is enabled Quarkus will attempt to automatically configure and start a database when running in Dev or Test mode and when Docker is running. Environment variable: `QUARKUS_LANGCHAIN4J_CHROMA_DEVSERVICES_ENABLED`	boolean	`true`
`quarkus.langchain4j.chroma.devservices.image-name` The container image name to use, for container based DevServices providers. If you want to use Redis Stack modules (bloom, graph, search…), use: `redis/redis-stack:latest`. Environment variable: `QUARKUS_LANGCHAIN4J_CHROMA_DEVSERVICES_IMAGE_NAME`	string	`ghcr.io/chroma-core/chroma:0.4.15`
`quarkus.langchain4j.chroma.devservices.port` Optional fixed port the dev service will listen to. If not defined, the port will be chosen randomly. Environment variable: `QUARKUS_LANGCHAIN4J_CHROMA_DEVSERVICES_PORT`	int
`quarkus.langchain4j.chroma.devservices.shared` Indicates if the Redis server managed by Quarkus Dev Services is shared. When shared, Quarkus looks for running containers using label-based service discovery. If a matching container is found, it is used, and so a second one is not started. Otherwise, Dev Services for Redis starts a new container. The discovery uses the `quarkus-dev-service-chroma` label. The value is configured using the `service-name` property. Container sharing is only used in dev mode. Environment variable: `QUARKUS_LANGCHAIN4J_CHROMA_DEVSERVICES_SHARED`	boolean	`true`
`quarkus.langchain4j.chroma.devservices.service-name` The value of the `quarkus-dev-service-chroma` label attached to the started container. This property is used when `shared` is set to `true`. In this case, before starting a container, Dev Services for Redis looks for a container with the `quarkus-dev-service-chroma` label set to the configured value. If found, it will use this container instead of starting a new one. Otherwise, it starts a new container with the `quarkus-dev-service-chroma` label set to the specified value. This property is used when you need multiple shared Chroma servers. Environment variable: `QUARKUS_LANGCHAIN4J_CHROMA_DEVSERVICES_SERVICE_NAME`	string	`chroma`
`quarkus.langchain4j.chroma.devservices.container-env."container-env"` Environment variables that are passed to the container. Environment variable: `QUARKUS_LANGCHAIN4J_CHROMA_DEVSERVICES_CONTAINER_ENV__CONTAINER_ENV_`	Map<String,String>
`quarkus.langchain4j.chroma.url` URL where the Chroma database is listening for requests Environment variable: `QUARKUS_LANGCHAIN4J_CHROMA_URL`	string	required
`quarkus.langchain4j.chroma.collection-name` The collection name. Environment variable: `QUARKUS_LANGCHAIN4J_CHROMA_COLLECTION_NAME`	string	`default`
`quarkus.langchain4j.chroma.timeout` The timeout duration for the Chroma client. If not specified, 5 seconds will be used. Environment variable: `QUARKUS_LANGCHAIN4J_CHROMA_TIMEOUT`	Duration
`quarkus.langchain4j.chroma.log-requests` Whether requests to Chroma should be logged Environment variable: `QUARKUS_LANGCHAIN4J_CHROMA_LOG_REQUESTS`	boolean	`false`
`quarkus.langchain4j.chroma.log-responses` Whether responses from Chroma should be logged Environment variable: `QUARKUS_LANGCHAIN4J_CHROMA_LOG_RESPONSES`	boolean	`false`

Configuration property

Type

Default

quarkus.langchain4j.chroma.devservices.enabled

If DevServices has been explicitly enabled or disabled. DevServices is generally enabled by default, unless there is an existing configuration present.

When DevServices is enabled Quarkus will attempt to automatically configure and start a database when running in Dev or Test mode and when Docker is running.

Environment variable: QUARKUS_LANGCHAIN4J_CHROMA_DEVSERVICES_ENABLED

boolean

true

quarkus.langchain4j.chroma.devservices.image-name

The container image name to use, for container based DevServices providers. If you want to use Redis Stack modules (bloom, graph, search…), use: redis/redis-stack:latest.

Environment variable: QUARKUS_LANGCHAIN4J_CHROMA_DEVSERVICES_IMAGE_NAME

string

ghcr.io/chroma-core/chroma:0.4.15

quarkus.langchain4j.chroma.devservices.port

Optional fixed port the dev service will listen to.

If not defined, the port will be chosen randomly.

Environment variable: QUARKUS_LANGCHAIN4J_CHROMA_DEVSERVICES_PORT

int

quarkus.langchain4j.chroma.devservices.shared

Indicates if the Redis server managed by Quarkus Dev Services is shared. When shared, Quarkus looks for running containers using label-based service discovery. If a matching container is found, it is used, and so a second one is not started. Otherwise, Dev Services for Redis starts a new container.

The discovery uses the quarkus-dev-service-chroma label. The value is configured using the service-name property.

Container sharing is only used in dev mode.

Environment variable: QUARKUS_LANGCHAIN4J_CHROMA_DEVSERVICES_SHARED

boolean

true

quarkus.langchain4j.chroma.devservices.service-name

The value of the quarkus-dev-service-chroma label attached to the started container. This property is used when shared is set to true. In this case, before starting a container, Dev Services for Redis looks for a container with the quarkus-dev-service-chroma label set to the configured value. If found, it will use this container instead of starting a new one. Otherwise, it starts a new container with the quarkus-dev-service-chroma label set to the specified value.

This property is used when you need multiple shared Chroma servers.

Environment variable: QUARKUS_LANGCHAIN4J_CHROMA_DEVSERVICES_SERVICE_NAME

string

chroma

quarkus.langchain4j.chroma.devservices.container-env."container-env"

Environment variables that are passed to the container.

Environment variable: QUARKUS_LANGCHAIN4J_CHROMA_DEVSERVICES_CONTAINER_ENV__CONTAINER_ENV_

Map<String,String>

quarkus.langchain4j.chroma.url

URL where the Chroma database is listening for requests

Environment variable: QUARKUS_LANGCHAIN4J_CHROMA_URL

string