Ollama Embedding Models

Ollama supports embedding models suitable for semantic search, document retrieval, and RAG-style workflows. These models run locally, just like chat models.

Prerequisites

Ollama Installation

To use embedding models, you must have a working Ollama setup. Refer to Ollama Chat Models for details on installation and Dev Service support.

Enabling Ollama

To enable embedding support, include the following extension:

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-ollama</artifactId>
    <version>1.0.2</version>
</dependency>

Default Model

By default, the embedding model is set to nomic-embed-text.

You can override this using:

quarkus.langchain4j.ollama.embedding-model.model-name=bge-m3

You may also wish to configure logging during development:

quarkus.langchain4j.log-requests=true
quarkus.langchain4j.log-responses=true

Programmatic Usage

You can inject the embedding model directly:

@Inject EmbeddingModel model;

This will retrieve the embedding model configured in application.properties.

Dynamic Authorization

To provide dynamic authorization headers, implement ModelAuthProvider:

import io.quarkiverse.langchain4j.auth.ModelAuthProvider;
import jakarta.enterprise.context.ApplicationScoped;

@ApplicationScoped
public class MyAuthProvider implements ModelAuthProvider {

    @Override
    public String getAuthorization(Input input) {
        return "Bearer " + fetchToken();
    }
}

Configuration Reference

Configuration property fixed at build time - All other configuration properties are overridable at runtime

Configuration property

Type

Default

Whether the model should be enabled

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_ENABLED

boolean

true

Whether the model should be enabled

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_ENABLED

boolean

true

If Dev Services for Ollama has been explicitly enabled or disabled. Dev Services are generally enabled by default, unless there is an existing configuration present.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_DEVSERVICES_ENABLED

boolean

true

The Ollama container image to use.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_DEVSERVICES_IMAGE_NAME

string

ollama/ollama:latest

Model to use

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_MODEL_ID

string

llama3.2

Model to use. According to Ollama docs, the default value is nomic-embed-text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_MODEL_ID

string

nomic-embed-text

Base URL where the Ollama serving is running

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_BASE_URL

string

If set, the named TLS configuration with the configured name will be applied to the REST Client

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_TLS_CONFIGURATION_NAME

string

Timeout for Ollama calls

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_TIMEOUT

Duration 

10s

Whether the Ollama client should log requests

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_LOG_REQUESTS

boolean

false

Whether the Ollama client should log responses

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_LOG_RESPONSES

boolean

false

Whether to enable the integration. Defaults to true, which means requests are made to the OpenAI provider. Set to false to disable all requests.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_ENABLE_INTEGRATION

boolean

true

The temperature of the model. Increasing the temperature will make the model answer with more variability. A lower temperature will make the model answer more conservatively.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_TEMPERATURE

double

${quarkus.langchain4j.temperature:0.8}

Maximum number of tokens to predict when generating text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_NUM_PREDICT

int

Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_STOP

list of string

Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_TOP_P

double

0.9

Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_TOP_K

int

40

With a static number the result is always the same. With a random number the result varies Example:

`Random random = new Random();
int x = random.nextInt(Integer.MAX_VALUE);`

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_SEED

int

The format to return a response in. Format can be json or a JSON schema.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_FORMAT

string

Whether chat model requests should be logged

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_LOG_REQUESTS

boolean

false

Whether chat model responses should be logged

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_LOG_RESPONSES

boolean

false

The temperature of the model. Increasing the temperature will make the model answer with more variability. A lower temperature will make the model answer more conservatively.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_TEMPERATURE

double

${quarkus.langchain4j.temperature:0.8}

Maximum number of tokens to predict when generating text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_NUM_PREDICT

int

128

Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_STOP

list of string

Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_TOP_P

double

0.9

Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_TOP_K

int

40

Whether embedding model requests should be logged

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_LOG_REQUESTS

boolean

false

Whether embedding model responses should be logged

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_LOG_RESPONSES

boolean

false

Named model config

Type

Default

Model to use

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_MODEL_ID

string

llama3.2

Model to use. According to Ollama docs, the default value is nomic-embed-text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__EMBEDDING_MODEL_MODEL_ID

string

nomic-embed-text

Base URL where the Ollama serving is running

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__BASE_URL

string

If set, the named TLS configuration with the configured name will be applied to the REST Client

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__TLS_CONFIGURATION_NAME

string

Timeout for Ollama calls

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__TIMEOUT

Duration 

10s

Whether the Ollama client should log requests

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__LOG_REQUESTS

boolean

false

Whether the Ollama client should log responses

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__LOG_RESPONSES

boolean

false

Whether to enable the integration. Defaults to true, which means requests are made to the OpenAI provider. Set to false to disable all requests.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__ENABLE_INTEGRATION

boolean

true

The temperature of the model. Increasing the temperature will make the model answer with more variability. A lower temperature will make the model answer more conservatively.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_TEMPERATURE

double

${quarkus.langchain4j.temperature:0.8}

Maximum number of tokens to predict when generating text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_NUM_PREDICT

int

Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_STOP

list of string

Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_TOP_P

double

0.9

Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_TOP_K

int

40

With a static number the result is always the same. With a random number the result varies Example:

`Random random = new Random();
int x = random.nextInt(Integer.MAX_VALUE);`

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_SEED

int

The format to return a response in. Format can be json or a JSON schema.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_FORMAT

string

Whether chat model requests should be logged

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_LOG_REQUESTS

boolean

false

Whether chat model responses should be logged

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_LOG_RESPONSES

boolean

false

The temperature of the model. Increasing the temperature will make the model answer with more variability. A lower temperature will make the model answer more conservatively.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__EMBEDDING_MODEL_TEMPERATURE

double

${quarkus.langchain4j.temperature:0.8}

Maximum number of tokens to predict when generating text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__EMBEDDING_MODEL_NUM_PREDICT

int

128

Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__EMBEDDING_MODEL_STOP

list of string

Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__EMBEDDING_MODEL_TOP_P

double

0.9

Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__EMBEDDING_MODEL_TOP_K

int

40

Whether embedding model requests should be logged

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__EMBEDDING_MODEL_LOG_REQUESTS

boolean

false

Whether embedding model responses should be logged

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__EMBEDDING_MODEL_LOG_RESPONSES

boolean

false

About the Duration format

To write duration values, use the standard java.time.Duration format. See the Duration#parse() Java API documentation for more information.

You can also use a simplified format, starting with a number:

  • If the value is only a number, it represents time in seconds.

  • If the value is a number followed by ms, it represents time in milliseconds.

In other cases, the simplified format is translated to the java.time.Duration format for parsing:

  • If the value is a number followed by h, m, or s, it is prefixed with PT.

  • If the value is a number followed by d, it is prefixed with P.