Llama3.java Chat Models
Llama3.java enables running Large Language Models (LLMs) locally and purely in Java, embedded in your Quarkus application. It supports a growing collection of models available on Hugging Face under https://huggingface.co/mukel, such as Llama3 and Mistral variants.
Prerequisites
Java Version and Vector API
Llama3.java requires Java 21 or later due to its use of the Java Vector API for high-performance inference.
Since the Vector API is still a preview feature (as of Java 21–23), you must enable it explicitly:
--enable-preview --enable-native-access=ALL-UNNAMED --add-modules jdk.incubator.vector
Dev Mode Support
When using Dev Mode, the extension:
-
Automatically pulls and configures the selected model.
-
Ensures the C2 JIT compiler is enabled for optimal runtime performance.
-
Allows you to configure the model directory via:
quarkus.langchain4j.llama3.models-path=/your/custom/location
Model files are large (e.g., Llama3 models can exceed several GB) and may take time to download. |
Using Llama3.java
To integrate the Llama3.java chat model into your Quarkus application, add:
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-llama3-java</artifactId>
<version>1.0.2</version>
</dependency>
If no other LLM extension is installed, AI Services will automatically use the configured Llama3.java chat model.
Chat Model Configuration
By default, the extension uses:
To configure a different model, update the following property:
quarkus.langchain4j.llama3.chat-model.model-name=mukel/Llama-3.2-3B-Instruct-GGUF
Configuration Reference
Configuration property fixed at build time - All other configuration properties are overridable at runtime
Configuration property |
Type |
Default |
---|---|---|
Determines whether the necessary Jlama models are downloaded and included in the jar at build time. Currently, this option is only valid for Environment variable: |
boolean |
|
Whether the model should be enabled Environment variable: |
boolean |
|
Model name to use Environment variable: |
string |
|
Quantization of the model to use Environment variable: |
string |
|
Llama3.java supports AOT model preloading, enabling 0-overhead, instant inference, with minimal TTFT (time-to-first-token). A specialized, larger binary will be generated, with no parsing overhead for that particular model. It can still run other models, although incurring the usual parsing overhead. Environment variable: |
boolean |
|
Location on the file-system which serves as a cache for the models Environment variable: |
path |
|
Temperature in [0,inf] Environment variable: |
double |
|
Number of steps to run for < 0 = limited by context length Environment variable: |
int |
|
Whether to enable the integration. Set to Environment variable: |
boolean |
|
Whether Jlama should log requests Environment variable: |
boolean |
|
Whether Jlama client should log responses Environment variable: |
boolean |
|
Type |
Default |
|
Model name to use Environment variable: |
string |
|
Quantization of the model to use Environment variable: |
string |
|
Llama3.java supports AOT model preloading, enabling 0-overhead, instant inference, with minimal TTFT (time-to-first-token). A specialized, larger binary will be generated, with no parsing overhead for that particular model. It can still run other models, although incurring the usual parsing overhead. Environment variable: |
boolean |
|
Temperature in [0,inf] Environment variable: |
double |
|
Number of steps to run for < 0 = limited by context length Environment variable: |
int |
|
Whether to enable the integration. Set to Environment variable: |
boolean |
|
Whether Jlama should log requests Environment variable: |
boolean |
|
Whether Jlama client should log responses Environment variable: |
boolean |
|