Jlama Chat Models
Jlama provides a way to run Large Language Models (LLMs) locally and in pure Java, embedded within your Quarkus application. It supports a growing set of models available on Hugging Face: https://huggingface.co/tjake.
Prerequisites
Java Version and Vector API
Jlama requires Java 21 or later because it leverages the Java Vector API for efficient inference. As this is a preview feature, you must enable it explicitly at runtime:
--enable-preview --enable-native-access=ALL-UNNAMED --add-modules jdk.incubator.vector
Dev Mode Support
When using Dev Mode:
-
The extension will automatically pull the configured model.
-
JVM flags are set up automatically to enable the C2 compiler, which is required for proper inference performance.
-
Disk space is required for downloaded models. The model directory can be customized via:
quarkus.langchain4j.jlama.models-path=/path/to/model/storage
Jlama models can be large (several GB) and may take time to download and initialize. |
Using Jlama
To integrate Jlama into your Quarkus project, add the following dependency:
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-jlama</artifactId>
<version>1.0.2</version>
</dependency>
If no other LLM extension is installed, AI Services will automatically use the configured Jlama chat model.
Chat Model Configuration
By default, Jlama uses the TinyLlama-1.1B-Chat-v1.0-Jlama-Q4
model:
quarkus.langchain4j.jlama.chat-model.model-name=tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4
To switch to another model, such as Granite, update the model name:
quarkus.langchain4j.jlama.chat-model.model-name=tjake/granite-3.0-2b-instruct-JQ4
Configuration Reference
Configuration property fixed at build time - All other configuration properties are overridable at runtime
Configuration property |
Type |
Default |
---|---|---|
Determines whether the necessary Jlama models are downloaded and included in the jar at build time. Currently, this option is only valid for Environment variable: |
boolean |
|
Whether the model should be enabled Environment variable: |
boolean |
|
Whether the model should be enabled Environment variable: |
boolean |
|
Model name to use Environment variable: |
string |
|
Model name to use Environment variable: |
string |
|
Location on the file-system which serves as a cache for the models Environment variable: |
path |
|
What sampling temperature to use, between 0.0 and 1.0. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. It is generally recommended to set this or the Environment variable: |
double |
|
The maximum number of tokens to generate in the completion. The token count of your prompt plus Environment variable: |
int |
|
Whether to enable the integration. Set to Environment variable: |
boolean |
|
Whether Jlama should log requests Environment variable: |
boolean |
|
Whether Jlama client should log responses Environment variable: |
boolean |
|
Type |
Default |
|
Model name to use Environment variable: |
string |
|
Model name to use Environment variable: |
string |
|
What sampling temperature to use, between 0.0 and 1.0. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. It is generally recommended to set this or the Environment variable: |
double |
|
The maximum number of tokens to generate in the completion. The token count of your prompt plus Environment variable: |
int |
|
Whether to enable the integration. Set to Environment variable: |
boolean |
|
Whether Jlama should log requests Environment variable: |
boolean |
|
Whether Jlama client should log responses Environment variable: |
boolean |
|