Ollama
Prerequisites
To use Ollama, you need to have a running Ollama installed. Ollama is available for all major platforms and its installation is quite easy, simply visit Ollama download page and follow the instructions.
Once installed, check that Ollama is running using:
> ollama --version
Dev Service
The Dev Service included with the Ollama extension can do lots of things.
The Dev Service automatically handles the pulling of the models configured by the application, so there is no need for users to do so manually.
Additionally, if you aren’t already running a local Ollama instance (either via the desktop client or a local container image) then it will first start the Ollama container on a random port and bind it to your application by setting quarkus.langchain4j.ollama.*.base-url
to the URL where Ollama is running.
The container will also share downloaded models with any local client, so a model only needs to be downloaded the first time, regardless of whether you use the local Ollama client or the container provided by the Dev Service.
If the Dev Service starts an Ollama container, it will expose the following configuration properties that you can use within your own configuration should you need to:
langchain4j-ollama-dev-service.ollama.host=host (1)
langchain4j-ollama-dev-service.ollama.port=port (2)
langchain4j-ollama-dev-service.ollama.endpoint=http://${langchain4j-ollama-dev-service.ollama.host}:${langchain4j-ollama-dev-service.ollama.port} (3)
-
The host that the container is running on. Typically
localhost
, but it could be the name of the container network. -
The port that the Ollama container is running on.
-
The fully-qualified url (host + port) to the running Ollama container.
Models are huge. For example Llama3 is 4.7Gb, so make sure you have enough disk space. |
Due to model’s large size, pulling them can take time |
Using Ollama
To integrate with models running on Ollama, add the following dependency into your project:
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-ollama</artifactId>
<version>0.23.0.CR1</version>
</dependency>
If no other LLM extension is installed, AI Services will automatically utilize the configured Ollama model.
By default, the extension uses llama3.2
, the model we pulled in the previous section.
You can change it by setting the quarkus.langchain4j.ollama.chat-model.model-id
property in the application.properties
file:
quarkus.langchain4j.ollama.chat-model.model-id=mistral
Configuration
Several configuration properties are available:
Configuration property fixed at build time - All other configuration properties are overridable at runtime
Configuration property |
Type |
Default |
---|---|---|
Whether the model should be enabled Environment variable: |
boolean |
|
Whether the model should be enabled Environment variable: |
boolean |
|
If Dev Services for Ollama has been explicitly enabled or disabled. Dev Services are generally enabled by default, unless there is an existing configuration present. Environment variable: |
boolean |
|
The Ollama container image to use. Environment variable: |
string |
|
Model to use Environment variable: |
string |
|
Model to use. According to Ollama docs, the default value is Environment variable: |
string |
|
Base URL where the Ollama serving is running Environment variable: |
string |
|
If set, the named TLS configuration with the configured name will be applied to the REST Client Environment variable: |
string |
|
Timeout for Ollama calls Environment variable: |
|
|
Whether the Ollama client should log requests Environment variable: |
boolean |
|
Whether the Ollama client should log responses Environment variable: |
boolean |
|
Whether to enable the integration. Defaults to Environment variable: |
boolean |
|
The temperature of the model. Increasing the temperature will make the model answer with more variability. A lower temperature will make the model answer more conservatively. Environment variable: |
double |
|
Maximum number of tokens to predict when generating text Environment variable: |
int |
|
Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return Environment variable: |
list of string |
|
Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text Environment variable: |
double |
|
Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative Environment variable: |
int |
|
With a static number the result is always the same. With a random number the result varies Example:
Environment variable: |
int |
|
The format to return a response in. Format can be Environment variable: |
string |
|
Whether chat model requests should be logged Environment variable: |
boolean |
|
Whether chat model responses should be logged Environment variable: |
boolean |
|
The temperature of the model. Increasing the temperature will make the model answer with more variability. A lower temperature will make the model answer more conservatively. Environment variable: |
double |
|
Maximum number of tokens to predict when generating text Environment variable: |
int |
|
Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return Environment variable: |
list of string |
|
Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text Environment variable: |
double |
|
Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative Environment variable: |
int |
|
Whether embedding model requests should be logged Environment variable: |
boolean |
|
Whether embedding model responses should be logged Environment variable: |
boolean |
|
Type |
Default |
|
Model to use Environment variable: |
string |
|
Model to use. According to Ollama docs, the default value is Environment variable: |
string |
|
Base URL where the Ollama serving is running Environment variable: |
string |
|
If set, the named TLS configuration with the configured name will be applied to the REST Client Environment variable: |
string |
|
Timeout for Ollama calls Environment variable: |
|
|
Whether the Ollama client should log requests Environment variable: |
boolean |
|
Whether the Ollama client should log responses Environment variable: |
boolean |
|
Whether to enable the integration. Defaults to Environment variable: |
boolean |
|
The temperature of the model. Increasing the temperature will make the model answer with more variability. A lower temperature will make the model answer more conservatively. Environment variable: |
double |
|
Maximum number of tokens to predict when generating text Environment variable: |
int |
|
Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return Environment variable: |
list of string |
|
Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text Environment variable: |
double |
|
Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative Environment variable: |
int |
|
With a static number the result is always the same. With a random number the result varies Example:
Environment variable: |
int |
|
The format to return a response in. Format can be Environment variable: |
string |
|
Whether chat model requests should be logged Environment variable: |
boolean |
|
Whether chat model responses should be logged Environment variable: |
boolean |
|
The temperature of the model. Increasing the temperature will make the model answer with more variability. A lower temperature will make the model answer more conservatively. Environment variable: |
double |
|
Maximum number of tokens to predict when generating text Environment variable: |
int |
|
Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return Environment variable: |
list of string |
|
Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text Environment variable: |
double |
|
Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative Environment variable: |
int |
|
Whether embedding model requests should be logged Environment variable: |
boolean |
|
Whether embedding model responses should be logged Environment variable: |
boolean |
|
About the Duration format
To write duration values, use the standard You can also use a simplified format, starting with a number:
In other cases, the simplified format is translated to the
|
Document Retriever and Embedding
Ollama also provides embedding models.
By default, it uses nomic-embed-text
.
You can change the default embedding model by setting the quarkus.langchain4j.ollama.embedding-model.model-id
property in the application.properties
file:
quarkus.langchain4j.log-requests=true
quarkus.langchain4j.log-responses=true
quarkus.langchain4j.ollama.chat-model.model-id=mistral
quarkus.langchain4j.ollama.embedding-model.model-id=mistral
If no other LLM extension is installed, retrieve the embedding model as follows:
@Inject EmbeddingModel model; // Injects the embedding model
Dynamic Authorization Headers
There are cases where one may need to provide dynamic authorization headers, to be passed to Ollama endpoints
There are two ways to achieve this:
Using a ContainerRequestFilter annotated with @Provider
.
As the underlying HTTP communication relies on the Quarkus Rest Client, it is possible to apply a filter that will be called in all OpenAI requests and set the headers accordingly.
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;
import jakarta.ws.rs.ext.Provider;
import org.jboss.resteasy.reactive.client.spi.ResteasyReactiveClientRequestContext;
import org.jboss.resteasy.reactive.client.spi.ResteasyReactiveClientRequestFilter;
@Provider
@ApplicationScoped
public class RequestFilter implements ResteasyReactiveClientRequestFilter {
@Inject
MyAuthorizationService myAuthorizationService;
@Override
public void filter(ResteasyReactiveClientRequestContext requestContext) {
/*
* All requests will be filtered here, therefore make sure that you make
* the necessary checks to avoid putting the Authorization header in
* requests that do not need it.
*/
requestContext.getHeaders().putSingle("Authorization", ...);
}
}
Using AuthProvider
An even simpler approach consists of implementing the ModelAuthProvider
interface and provide the implementation of the getAuthorization
method.
This is useful when you need to provide different authorization headers for different OpenAI models. The @ModelName
annotation can be used to specify the model name in this scenario.
import io.quarkiverse.langchain4j.ModelName;
import io.quarkiverse.langchain4j.auth.ModelAuthProvider;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;
@ApplicationScoped
@ModelName("my-model-name") //you can omit this if you have only one model or if you want to use the default model
public class TestClass implements ModelAuthProvider {
@Inject MyTokenProviderService tokenProviderService;
@Override
public String getAuthorization(Input input) {
/*
* The `input` will contain some information about the request
* about to be passed to the remote model endpoints
*/
return "Bearer " + tokenProviderService.getToken();
}
}