PGVector Document Store

The PGVector extension allows you to use PostgreSQL as a vector database for Retrieval-Augmented Generation (RAG) with Quarkus LangChain4j. It leverages the pgvector extension in PostgreSQL to store and search vector embeddings efficiently.

Prerequisites

To use PGVector as a document store:

A PostgreSQL instance with the pgvector extension installed is required.
A Quarkus datasource must be configured.
The embedding vector dimension must match your embedding model.

PGVector is a native PostgreSQL extension that adds vector similarity search capabilities to PostgreSQL. It supports L2, cosine, and inner-product distance metrics.

In dev mode, the quarkus-langchain4j-pgvector extension will automatically start a PostgreSQL instance with the pgvector extension enabled.

Dependency

To enable PGVector integration in your Quarkus project, add the following Maven dependency:

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-pgvector</artifactId>
    <version>1.5.0.CR2</version>
</dependency>

This extension requires a configured Quarkus datasource. For configuration details, refer to the Quarkus DataSource Guide.

Embedding Dimension

You must explicitly configure the dimensionality of the embedding vector:

quarkus.langchain4j.pgvector.dimension=384

This value depends on the embedding model in use:

AllMiniLmL6V2QuantizedEmbeddingModel → 384
OpenAI text-embedding-ada-002 → 1536

If the embedding dimension is missing or mismatched, ingestion and retrieval will fail or produce inaccurate results.

If you switch to a different embedding model, ensure the dimension value is updated accordingly.

Embedding Index

Use-index controls whether the engine should create and use an index over the embeddings table. To create an index, you need the index-list-size parameter, which is a fine-tuning parameter. This basically creates an IVFFlat index with clusters to extend the effectivness of nearest neighbor search. If you set the use-index=true you will have to also se the index-list-size.

This is something you rarely would need in development.

quarkus.langchain4j.pgvector.use-index=true
quarkus.langchain4j.pgvector.index-list-size=10

Higher number of lists values speed up queries by reducing the search space during query time. However, it also decreases the region size, which can lead to more recall errors by excluding some points. Additionally, more distance comparisons are required to find the closest cluster during step one of the query process.

Here are some recommendations for setting the lists parameter:

For datasets with less than one million rows, use lists = rows / 1000.
For datasets with more than one million rows, use lists = sqrt(rows).
It is generally advisable to have at least 10 clusters.

Usage Example

Once the extension is installed and configured, you can ingest documents into PGVector using the following code:

package io.quarkiverse.langchain4j.samples;

import static dev.langchain4j.data.document.splitter.DocumentSplitters.recursive;

import java.util.List;

import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;

import dev.langchain4j.data.document.Document;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStoreIngestor;
import io.quarkiverse.langchain4j.pgvector.PgVectorEmbeddingStore;

@ApplicationScoped
public class IngestorExampleWithPgvector {

    /**
     * The embedding store (the database).
     * The bean is provided by the quarkus-langchain4j-pgvector extension.
     */
    @Inject
    PgVectorEmbeddingStore store;

    /**
     * The embedding model (how is computed the vector of a document).
     * The bean is provided by the LLM (like openai) extension.
     */
    @Inject
    EmbeddingModel embeddingModel;

    public void ingest(List<Document> documents) {
        EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
                .embeddingStore(store)
                .embeddingModel(embeddingModel)
                .documentSplitter(recursive(500, 0))
                .build();
        // Warning - this can take a long time...
        ingestor.ingest(documents);
    }
}

This example shows how to embed and persist documents using the PGVector store, enabling efficient similarity search during RAG queries.

Configuration

Customize the behavior of the extension using the following configuration options:

Configuration property fixed at build time - All other configuration properties are overridable at runtime

Configuration property	Type	Default
`quarkus.langchain4j.pgvector.datasource` The name of the configured Postgres datasource to use for this store. If not set, the default datasource from the Agroal extension will be used. Environment variable: `QUARKUS_LANGCHAIN4J_PGVECTOR_DATASOURCE`	string
`quarkus.langchain4j.pgvector.table` The table name for storing embeddings Environment variable: `QUARKUS_LANGCHAIN4J_PGVECTOR_TABLE`	string	`embeddings`
`quarkus.langchain4j.pgvector.dimension` The dimension of the embedding vectors. This has to be the same as the dimension of vectors produced by the embedding model that you use. For example, AllMiniLmL6V2QuantizedEmbeddingModel produces vectors of dimension 384. OpenAI’s text-embedding-ada-002 produces vectors of dimension 1536. Environment variable: `QUARKUS_LANGCHAIN4J_PGVECTOR_DIMENSION`	int	required
`quarkus.langchain4j.pgvector.use-index` Use index or not Environment variable: `QUARKUS_LANGCHAIN4J_PGVECTOR_USE_INDEX`	boolean	`false`
`quarkus.langchain4j.pgvector.index-list-size` index size Environment variable: `QUARKUS_LANGCHAIN4J_PGVECTOR_INDEX_LIST_SIZE`	int	`0`
`quarkus.langchain4j.pgvector.create-table` Whether the table should be created if not already existing. Environment variable: `QUARKUS_LANGCHAIN4J_PGVECTOR_CREATE_TABLE`	boolean	`true`
`quarkus.langchain4j.pgvector.drop-table-first` Whether the table should be dropped prior to being created. Environment variable: `QUARKUS_LANGCHAIN4J_PGVECTOR_DROP_TABLE_FIRST`	boolean	`false`
`quarkus.langchain4j.pgvector.register-vector-pg-extension` Whether the PG extension should be created on Start. By Default, if it’s dev or test environment the value is overridden to true Environment variable: `QUARKUS_LANGCHAIN4J_PGVECTOR_REGISTER_VECTOR_PG_EXTENSION`	boolean	`false`
`quarkus.langchain4j.pgvector.metadata.storage-mode` Metadata type: COLUMN_PER_KEY: for static metadata, when you know in advance the list of metadata fields. In this case, you should also override the `quarkus.langchain4j.pgvector.metadata.column-definitions` property to define the right columns. COMBINED_JSON: For dynamic metadata, when you don’t know the list of metadata fields that will be used. COMBINED_JSONB: Same as JSON, but stored in a binary way. Optimized for query on large dataset. In this case, you should also override the `quarkus.langchain4j.pgvector.metadata.column-definitions` property to change the type of the `metadata` column to COMBINED_JSONB. Default value: COMBINED_JSON Environment variable: `QUARKUS_LANGCHAIN4J_PGVECTOR_METADATA_STORAGE_MODE`	`column-per-key`, `combined-json`, `combined-jsonb`	`combined-json`
`quarkus.langchain4j.pgvector.metadata.column-definitions` Metadata Definition: SQL definition of metadata field(s). By default, "metadata JSON NULL" configured. This is only suitable if using the JSON metadata type. If using JSONB metadata type, this should in most cases be set to `metadata JSONB NULL`. If using COLUMNS metadata type, this should be a list of columns, one column for each desired metadata field. Example: condominium_id uuid null, user uuid null Environment variable: `QUARKUS_LANGCHAIN4J_PGVECTOR_METADATA_COLUMN_DEFINITIONS`	list of string	`metadata JSON NULL`
`quarkus.langchain4j.pgvector.metadata.indexes` Metadata Indexes, list of fields to use as index. For instance: JSON: with JSON metadata, indexes are not allowed, so this property must be empty. To use indexes, switch to JSONB metadata. JSONB: (metadata→'key'), (metadata→'name'), (metadata→'age') COLUMNS: key, name, age Environment variable: `QUARKUS_LANGCHAIN4J_PGVECTOR_METADATA_INDEXES`	list of string
`quarkus.langchain4j.pgvector.metadata.index-type` Index Type: BTREE (default) GIN Other PostgreSQL index types Environment variable: `QUARKUS_LANGCHAIN4J_PGVECTOR_METADATA_INDEX_TYPE`	string	`BTREE`

Configuration property

Type

Default

quarkus.langchain4j.pgvector.datasource

The name of the configured Postgres datasource to use for this store. If not set, the default datasource from the Agroal extension will be used.

Environment variable: QUARKUS_LANGCHAIN4J_PGVECTOR_DATASOURCE

string

quarkus.langchain4j.pgvector.table

The table name for storing embeddings

Environment variable: QUARKUS_LANGCHAIN4J_PGVECTOR_TABLE

string

embeddings

quarkus.langchain4j.pgvector.dimension

The dimension of the embedding vectors. This has to be the same as the dimension of vectors produced by the embedding model that you use. For example, AllMiniLmL6V2QuantizedEmbeddingModel produces vectors of dimension 384. OpenAI’s text-embedding-ada-002 produces vectors of dimension 1536.

Environment variable: QUARKUS_LANGCHAIN4J_PGVECTOR_DIMENSION

int

required

quarkus.langchain4j.pgvector.use-index

Use index or not

Environment variable: QUARKUS_LANGCHAIN4J_PGVECTOR_USE_INDEX

boolean

false

quarkus.langchain4j.pgvector.index-list-size

index size

Environment variable: QUARKUS_LANGCHAIN4J_PGVECTOR_INDEX_LIST_SIZE

int

0

quarkus.langchain4j.pgvector.create-table

Whether the table should be created if not already existing.

Environment variable: QUARKUS_LANGCHAIN4J_PGVECTOR_CREATE_TABLE

boolean

true

quarkus.langchain4j.pgvector.drop-table-first

Whether the table should be dropped prior to being created.

Environment variable: QUARKUS_LANGCHAIN4J_PGVECTOR_DROP_TABLE_FIRST

boolean

false

quarkus.langchain4j.pgvector.register-vector-pg-extension

Whether the PG extension should be created on Start. By Default, if it’s dev or test environment the value is overridden to true

Environment variable: QUARKUS_LANGCHAIN4J_PGVECTOR_REGISTER_VECTOR_PG_EXTENSION

boolean

false

quarkus.langchain4j.pgvector.metadata.storage-mode

Metadata type:

COLUMN_PER_KEY: for static metadata, when you know in advance the list of metadata fields. In this case, you should also override the quarkus.langchain4j.pgvector.metadata.column-definitions property to define the right columns.
COMBINED_JSON: For dynamic metadata, when you don’t know the list of metadata fields that will be used.
COMBINED_JSONB: Same as JSON, but stored in a binary way. Optimized for query on large dataset. In this case, you should also override the quarkus.langchain4j.pgvector.metadata.column-definitions property to change the type of the metadata column to COMBINED_JSONB.

Default value: COMBINED_JSON

Environment variable: QUARKUS_LANGCHAIN4J_PGVECTOR_METADATA_STORAGE_MODE

column-per-key, combined-json, combined-jsonb

combined-json

quarkus.langchain4j.pgvector.metadata.column-definitions

Metadata Definition: SQL definition of metadata field(s). By default, "metadata JSON NULL" configured. This is only suitable if using the JSON metadata type.

If using JSONB metadata type, this should in most cases be set to metadata JSONB NULL.

If using COLUMNS metadata type, this should be a list of columns, one column for each desired metadata field. Example: condominium_id uuid null, user uuid null

Environment variable: QUARKUS_LANGCHAIN4J_PGVECTOR_METADATA_COLUMN_DEFINITIONS

list of string

metadata JSON NULL

quarkus.langchain4j.pgvector.metadata.indexes

Metadata Indexes, list of fields to use as index.

For instance:

JSON: with JSON metadata, indexes are not allowed, so this property must be empty. To use indexes, switch to JSONB metadata.
JSONB: (metadata→'key'), (metadata→'name'), (metadata→'age')
COLUMNS: key, name, age

Environment variable: QUARKUS_LANGCHAIN4J_PGVECTOR_METADATA_INDEXES

list of string

quarkus.langchain4j.pgvector.metadata.index-type

Index Type:

BTREE (default)
GIN
Other PostgreSQL index types

Environment variable: QUARKUS_LANGCHAIN4J_PGVECTOR_METADATA_INDEX_TYPE

string

BTREE

How It Works

The PGVector extension maps each ingested document to a row in a PostgreSQL table* Each row contains:

The original text content
Optional metadata
The vector embedding (stored as a vector type column)

During retrieval, a similarity search (e.g., cosine distance) is performed using a SELECT query with ORDER BY embedding <⇒ :query_vector LIMIT N.

The extension manages schema creation and indexing automatically unless overridden.

Summary

To use PostgreSQL and PGVector as a document store with Quarkus LangChain4j:

Ensure the pgvector extension is installed in your PostgreSQL instance.
Add the extension dependency.
Configure a datasource and set the correct embedding dimension.
Use PgVectorEmbeddingStore to ingest and retrieve embedded documents.