Weaviate Store for Retrieval Augmented Generation (RAG)

When implementing Retrieval Augmented Generation (RAG), a robust document store is crucial. This guide demonstrates how to leverage a Weaviate database as the document store.

Leveraging the Weaviate embedding store

To make use of the Weaviate embedding store, you’ll need to include the following dependency:

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-weaviate</artifactId>
</dependency>

This extension includes a dev service. Therefore, if you’re operating in a container environment, a Weaviate instance will automatically start in dev and test mode.

Upon installing the extension, you can use the Weaviate document store with the following code:

package io.quarkiverse.langchain4j.samples;

import static dev.langchain4j.data.document.splitter.DocumentSplitters.recursive;

import java.util.List;

import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;

import dev.langchain4j.data.document.Document;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStoreIngestor;
import dev.langchain4j.store.embedding.weaviate.WeaviateEmbeddingStore;

@ApplicationScoped
public class IngestorExampleWithWeaviate {

    /**
     * The embedding store (the database).
     * The bean is provided by the quarkus-langchain4j-weaviate extension.
     */
    @Inject
    WeaviateEmbeddingStore store;

    /**
     * The embedding model (how is computed the vector of a document).
     * The bean is provided by the LLM (like openai) extension.
     */
    @Inject
    EmbeddingModel embeddingModel;

    public void ingest(List<Document> documents) {
        EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
                .embeddingStore(store)
                .embeddingModel(embeddingModel)
                .documentSplitter(recursive(500, 0))
                .build();
        // Warning - this can take a long time...
        ingestor.ingest(documents);
    }
}
When using Weaviate as an embedding store, you don’t need an in-process embedding model. Weaviate can generate embeddings using its built-in modules or delegate to an external provider, such as OpenAI’s embedding API or other compatible models.

To use a remote Weaviate instance, you have to also set the host and port, in which case dev-services will not start another instance:

quarkus.langchain4j.weaviate.host=localhost
quarkus.langchain4j.weaviate.port=8080

Configuration Settings

Customize the behavior of the extension by exploring various configuration options:

Configuration property fixed at build time - All other configuration properties are overridable at runtime

Configuration property

Type

Default

If DevServices has been explicitly enabled or disabled. DevServices is generally enabled by default, unless there is an existing configuration present.

When DevServices is enabled Quarkus will attempt to automatically configure and start a database when running in Dev or Test mode and when Docker is running.

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_DEVSERVICES_ENABLED

boolean

true

The container image name to use, for container based DevServices providers. If you want to use Redis Stack modules (bloom, graph, search…​), use: redis/redis-stack:latest.

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_DEVSERVICES_IMAGE_NAME

string

cr.weaviate.io/semitechnologies/weaviate:1.25.5

Optional fixed port the dev service will listen to.

If not defined, the port will be chosen randomly.

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_DEVSERVICES_PORT

int

Indicates if the Redis server managed by Quarkus Dev Services is shared. When shared, Quarkus looks for running containers using label-based service discovery. If a matching container is found, it is used, and so a second one is not started. Otherwise, Dev Services for Redis starts a new container.

The discovery uses the quarkus-dev-service-weaviate label. The value is configured using the service-name property.

Container sharing is only used in dev mode.

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_DEVSERVICES_SHARED

boolean

true

The value of the quarkus-dev-service-weaviate label attached to the started container. This property is used when shared is set to true. In this case, before starting a container, Dev Services for Redis looks for a container with the quarkus-dev-service-weaviate label set to the configured value. If found, it will use this container instead of starting a new one. Otherwise, it starts a new container with the quarkus-dev-service-weaviate label set to the specified value.

This property is used when you need multiple shared Weaviate servers.

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_DEVSERVICES_SERVICE_NAME

string

weaviate

Environment variables that are passed to the container.

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_DEVSERVICES_CONTAINER_ENV__CONTAINER_ENV_

Map<String,String>

The Weaviate API key to authenticate with.

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_API_KEY

string

The scheme, e.g. "https" of cluster URL. Find it under Details of your Weaviate cluster.

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_SCHEME

string

http

The URL of the Weaviate server.

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_HOST

string

localhost

The gRPC port of the Weaviate server. Defaults to 8080

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_PORT

int

8080

The gRPC port of the Weaviate server. Defaults to 50051

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_GRPC_PORT

int

50051

The gRPC connection is secured.

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_GRPC_SECURE

boolean

false

Use gRPC instead of http for batch inserts only. Will still be used for search.

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_GRPC_USE_FOR_INSERTS

boolean

false

The object class you want to store, e.g. "MyGreatClass". Must start from an uppercase letter.

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_OBJECT_CLASS

string

Default

The name of the field that contains the text of a TextSegment. Default is "text"

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_TEXT_FIELD_NAME

string

text

If true (default), then WeaviateEmbeddingStore will generate a hashed ID based on provided text segment, which avoids duplicated entries in DB. If false, then random ID will be generated.

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_AVOID_DUPS

boolean

false

Consistency level: ONE, QUORUM (default) or ALL.

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_CONSISTENCY_LEVEL

one, quorum, all

quorum

Metadata keys that should be persisted. The default in Weaviate [], however it is required to specify at least one for the EmbeddingStore to work. Thus, we use "tags" as default

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_METADATA_KEYS

list of string

tags

The name of the field where Metadata entries are stored

Environment variable: QUARKUS_LANGCHAIN4J_WEAVIATE_METADATA_FIELD_NAME

string

_metadata