Ollama

Ollama provides a way to run large language models (LLMs) locally. You can run many models such as LLama3, Mistral, CodeLlama and many others on your machine, with full CPU and GPU support.

Prerequisites

To use Ollama, you need to have a running Ollama installed. Ollama is available for all major platforms and its installation is quite easy, simply visit Ollama download page and follow the instructions.

Once installed, check that Ollama is running using:

> ollama --version

Dev Service

The Dev Service included with the Ollama extension can do lots of things.

The Dev Service automatically handles the pulling of the models configured by the application, so there is no need for users to do so manually.

Additionally, if you aren’t already running a local Ollama instance (either via the desktop client or a local container image) then it will first start the Ollama container on a random port and bind it to your application by setting quarkus.langchain4j.ollama.*.base-url to the URL where Ollama is running.

The container will also share downloaded models with any local client, so a model only needs to be downloaded the first time, regardless of whether you use the local Ollama client or the container provided by the Dev Service.

If the Dev Service starts an Ollama container, it will expose the following configuration properties that you can use within your own configuration should you need to:

langchain4j-ollama-dev-service.ollama.host=host (1)
langchain4j-ollama-dev-service.ollama.port=port (2)
langchain4j-ollama-dev-service.ollama.endpoint=http://${langchain4j-ollama-dev-service.ollama.host}:${langchain4j-ollama-dev-service.ollama.port} (3)
  1. The host that the container is running on. Typically localhost, but it could be the name of the container network.

  2. The port that the Ollama container is running on.

  3. The fully-qualified url (host + port) to the running Ollama container.

Models are huge. For example Llama3 is 4.7Gb, so make sure you have enough disk space.
Due to model’s large size, pulling them can take time

Using Ollama

To integrate with models running on Ollama, add the following dependency into your project:

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-ollama</artifactId>
    <version>0.23.0.CR1</version>
</dependency>

If no other LLM extension is installed, AI Services will automatically utilize the configured Ollama model.

By default, the extension uses llama3.2, the model we pulled in the previous section. You can change it by setting the quarkus.langchain4j.ollama.chat-model.model-id property in the application.properties file:

quarkus.langchain4j.ollama.chat-model.model-id=mistral

Configuration

Several configuration properties are available:

Configuration property fixed at build time - All other configuration properties are overridable at runtime

Configuration property

Type

Default

Whether the model should be enabled

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_ENABLED

boolean

true

Whether the model should be enabled

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_ENABLED

boolean

true

If Dev Services for Ollama has been explicitly enabled or disabled. Dev Services are generally enabled by default, unless there is an existing configuration present.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_DEVSERVICES_ENABLED

boolean

true

The Ollama container image to use.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_DEVSERVICES_IMAGE_NAME

string

ollama/ollama:latest

Model to use

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_MODEL_ID

string

llama3.2

Model to use. According to Ollama docs, the default value is nomic-embed-text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_MODEL_ID

string

nomic-embed-text

Base URL where the Ollama serving is running

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_BASE_URL

string

If set, the named TLS configuration with the configured name will be applied to the REST Client

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_TLS_CONFIGURATION_NAME

string

Timeout for Ollama calls

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_TIMEOUT

Duration

10s

Whether the Ollama client should log requests

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_LOG_REQUESTS

boolean

false

Whether the Ollama client should log responses

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_LOG_RESPONSES

boolean

false

Whether to enable the integration. Defaults to true, which means requests are made to the OpenAI provider. Set to false to disable all requests.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_ENABLE_INTEGRATION

boolean

true

The temperature of the model. Increasing the temperature will make the model answer with more variability. A lower temperature will make the model answer more conservatively.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_TEMPERATURE

double

${quarkus.langchain4j.temperature:0.8}

Maximum number of tokens to predict when generating text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_NUM_PREDICT

int

Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_STOP

list of string

Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_TOP_P

double

0.9

Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_TOP_K

int

40

With a static number the result is always the same. With a random number the result varies Example:

Random random = new Random(); int x = random.nextInt(Integer.MAX_VALUE);

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_SEED

int

The format to return a response in. Format can be json or a JSON schema.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_FORMAT

string

Whether chat model requests should be logged

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_LOG_REQUESTS

boolean

false

Whether chat model responses should be logged

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_CHAT_MODEL_LOG_RESPONSES

boolean

false

The temperature of the model. Increasing the temperature will make the model answer with more variability. A lower temperature will make the model answer more conservatively.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_TEMPERATURE

double

${quarkus.langchain4j.temperature:0.8}

Maximum number of tokens to predict when generating text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_NUM_PREDICT

int

128

Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_STOP

list of string

Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_TOP_P

double

0.9

Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_TOP_K

int

40

Whether embedding model requests should be logged

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_LOG_REQUESTS

boolean

false

Whether embedding model responses should be logged

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA_EMBEDDING_MODEL_LOG_RESPONSES

boolean

false

Named model config

Type

Default

Model to use

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_MODEL_ID

string

llama3.2

Model to use. According to Ollama docs, the default value is nomic-embed-text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__EMBEDDING_MODEL_MODEL_ID

string

nomic-embed-text

Base URL where the Ollama serving is running

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__BASE_URL

string

If set, the named TLS configuration with the configured name will be applied to the REST Client

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__TLS_CONFIGURATION_NAME

string

Timeout for Ollama calls

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__TIMEOUT

Duration

10s

Whether the Ollama client should log requests

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__LOG_REQUESTS

boolean

false

Whether the Ollama client should log responses

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__LOG_RESPONSES

boolean

false

Whether to enable the integration. Defaults to true, which means requests are made to the OpenAI provider. Set to false to disable all requests.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__ENABLE_INTEGRATION

boolean

true

The temperature of the model. Increasing the temperature will make the model answer with more variability. A lower temperature will make the model answer more conservatively.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_TEMPERATURE

double

${quarkus.langchain4j.temperature:0.8}

Maximum number of tokens to predict when generating text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_NUM_PREDICT

int

Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_STOP

list of string

Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_TOP_P

double

0.9

Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_TOP_K

int

40

With a static number the result is always the same. With a random number the result varies Example:

Random random = new Random(); int x = random.nextInt(Integer.MAX_VALUE);

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_SEED

int

The format to return a response in. Format can be json or a JSON schema.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_FORMAT

string

Whether chat model requests should be logged

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_LOG_REQUESTS

boolean

false

Whether chat model responses should be logged

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__CHAT_MODEL_LOG_RESPONSES

boolean

false

The temperature of the model. Increasing the temperature will make the model answer with more variability. A lower temperature will make the model answer more conservatively.

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__EMBEDDING_MODEL_TEMPERATURE

double

${quarkus.langchain4j.temperature:0.8}

Maximum number of tokens to predict when generating text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__EMBEDDING_MODEL_NUM_PREDICT

int

128

Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__EMBEDDING_MODEL_STOP

list of string

Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__EMBEDDING_MODEL_TOP_P

double

0.9

Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__EMBEDDING_MODEL_TOP_K

int

40

Whether embedding model requests should be logged

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__EMBEDDING_MODEL_LOG_REQUESTS

boolean

false

Whether embedding model responses should be logged

Environment variable: QUARKUS_LANGCHAIN4J_OLLAMA__MODEL_NAME__EMBEDDING_MODEL_LOG_RESPONSES

boolean

false

About the Duration format

To write duration values, use the standard java.time.Duration format. See the Duration#parse() Java API documentation for more information.

You can also use a simplified format, starting with a number:

  • If the value is only a number, it represents time in seconds.

  • If the value is a number followed by ms, it represents time in milliseconds.

In other cases, the simplified format is translated to the java.time.Duration format for parsing:

  • If the value is a number followed by h, m, or s, it is prefixed with PT.

  • If the value is a number followed by d, it is prefixed with P.

Document Retriever and Embedding

Ollama also provides embedding models. By default, it uses nomic-embed-text.

You can change the default embedding model by setting the quarkus.langchain4j.ollama.embedding-model.model-id property in the application.properties file:

quarkus.langchain4j.log-requests=true
quarkus.langchain4j.log-responses=true

quarkus.langchain4j.ollama.chat-model.model-id=mistral
quarkus.langchain4j.ollama.embedding-model.model-id=mistral

If no other LLM extension is installed, retrieve the embedding model as follows:

@Inject EmbeddingModel model; // Injects the embedding model

Dynamic Authorization Headers

There are cases where one may need to provide dynamic authorization headers, to be passed to Ollama endpoints

There are two ways to achieve this:

Using a ContainerRequestFilter annotated with @Provider.

As the underlying HTTP communication relies on the Quarkus Rest Client, it is possible to apply a filter that will be called in all OpenAI requests and set the headers accordingly.

import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;
import jakarta.ws.rs.ext.Provider;
import org.jboss.resteasy.reactive.client.spi.ResteasyReactiveClientRequestContext;
import org.jboss.resteasy.reactive.client.spi.ResteasyReactiveClientRequestFilter;

@Provider
@ApplicationScoped
public class RequestFilter implements ResteasyReactiveClientRequestFilter {

    @Inject
    MyAuthorizationService myAuthorizationService;

    @Override
    public void filter(ResteasyReactiveClientRequestContext requestContext) {
        /*
         * All requests will be filtered here, therefore make sure that you make
         * the necessary checks to avoid putting the Authorization header in
         * requests that do not need it.
         */
        requestContext.getHeaders().putSingle("Authorization", ...);
    }
}

Using AuthProvider

An even simpler approach consists of implementing the ModelAuthProvider interface and provide the implementation of the getAuthorization method.

This is useful when you need to provide different authorization headers for different OpenAI models. The @ModelName annotation can be used to specify the model name in this scenario.

import io.quarkiverse.langchain4j.ModelName;
import io.quarkiverse.langchain4j.auth.ModelAuthProvider;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;

@ApplicationScoped
@ModelName("my-model-name") //you can omit this if you have only one model or if you want to use the default model
public class TestClass implements ModelAuthProvider {
    @Inject MyTokenProviderService tokenProviderService;

    @Override
    public String getAuthorization(Input input) {
        /*
         * The `input` will contain some information about the request
         * about to be passed to the remote model endpoints
         */
        return "Bearer " + tokenProviderService.getToken();
    }
}

Tools

Tools are supported in Ollama since version 0.3.0. However, not all models available in Ollama support them, consult this for an up-to-date list of the models that do.