Llama3.java Chat Models

Llama3.java enables running Large Language Models (LLMs) locally and purely in Java, embedded in your Quarkus application. It supports a growing collection of models available on Hugging Face under https://huggingface.co/mukel, such as Llama3 and Mistral variants.

Prerequisites

Java Version and Vector API

Llama3.java requires Java 21 or later due to its use of the Java Vector API for high-performance inference.

Since the Vector API is still a preview feature (as of Java 21–23), you must enable it explicitly:

--enable-preview --enable-native-access=ALL-UNNAMED --add-modules jdk.incubator.vector

Dev Mode Support

When using Dev Mode, the extension:

Automatically pulls and configures the selected model.
Ensures the C2 JIT compiler is enabled for optimal runtime performance.
Allows you to configure the model directory via:

quarkus.langchain4j.llama3.models-path=/your/custom/location

Model files are large (e.g., Llama3 models can exceed several GB) and may take time to download.

Native Mode Support

Llama3.java is compatible with GraalVM native mode, but only with Early Access versions of Oracle GraalVM 24.

For best native performance, add the following flags:

quarkus.native.additional-build-args=-O3,-march=native

Using Llama3.java

To integrate the Llama3.java chat model into your Quarkus application, add:

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-llama3-java</artifactId>
    <version>1.4.2</version>
</dependency>

If no other LLM extension is installed, AI Services will automatically use the configured Llama3.java chat model.

Chat Model Configuration

By default, the extension uses:

mukel/Llama-3.2-1B-Instruct-GGUF

To configure a different model, update the following property:

quarkus.langchain4j.llama3.chat-model.model-name=mukel/Llama-3.2-3B-Instruct-GGUF

Configuration Reference

Configuration property fixed at build time - All other configuration properties are overridable at runtime

Configuration property	Type	Default
`quarkus.langchain4j.llama3.include-models-in-artifact` Determines whether the necessary Jlama models are downloaded and included in the jar at build time. Currently, this option is only valid for `fast-jar` deployments. Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_INCLUDE_MODELS_IN_ARTIFACT`	boolean	`true`
`quarkus.langchain4j.llama3.chat-model.enabled` Whether the model should be enabled Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_ENABLED`	boolean	`true`
`quarkus.langchain4j.llama3.chat-model.model-name` Model name to use Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_MODEL_NAME`	string	`mukel/Llama-3.2-1B-Instruct-GGUF`
`quarkus.langchain4j.llama3.chat-model.quantization` Quantization of the model to use Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_QUANTIZATION`	string	`Q4_0`
`quarkus.langchain4j.llama3.chat-model.pre-load-in-native` Llama3.java supports AOT model preloading, enabling 0-overhead, instant inference, with minimal TTFT (time-to-first-token). A specialized, larger binary will be generated, with no parsing overhead for that particular model. It can still run other models, although incurring the usual parsing overhead. Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_PRE_LOAD_IN_NATIVE`	boolean	`false`
`quarkus.langchain4j.llama3.models-path` Location on the file-system which serves as a cache for the models Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_MODELS_PATH`	path	`${user.home}/.langchain4j/models`
`quarkus.langchain4j.llama3.chat-model.temperature` Temperature in [0,inf] Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_TEMPERATURE`	double	`0.1`
`quarkus.langchain4j.llama3.chat-model.max-tokens` Number of steps to run for < 0 = limited by context length Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_MAX_TOKENS`	int	`512`
`quarkus.langchain4j.llama3.enable-integration` Whether to enable the integration. Set to `false` to disable all requests. Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_ENABLE_INTEGRATION`	boolean	`true`
`quarkus.langchain4j.llama3.log-requests` Whether Jlama should log requests Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_LOG_REQUESTS`	boolean	`false`
`quarkus.langchain4j.llama3.log-responses` Whether Jlama client should log responses Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_LOG_RESPONSES`	boolean	`false`
Named model config	Type	Default
`quarkus.langchain4j.llama3."model-name".chat-model.model-name` Model name to use Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3__MODEL_NAME__CHAT_MODEL_MODEL_NAME`	string	`mukel/Llama-3.2-1B-Instruct-GGUF`
`quarkus.langchain4j.llama3."model-name".chat-model.quantization` Quantization of the model to use Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3__MODEL_NAME__CHAT_MODEL_QUANTIZATION`	string	`Q4_0`
`quarkus.langchain4j.llama3."model-name".chat-model.pre-load-in-native` Llama3.java supports AOT model preloading, enabling 0-overhead, instant inference, with minimal TTFT (time-to-first-token). A specialized, larger binary will be generated, with no parsing overhead for that particular model. It can still run other models, although incurring the usual parsing overhead. Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3__MODEL_NAME__CHAT_MODEL_PRE_LOAD_IN_NATIVE`	boolean	`false`
`quarkus.langchain4j.llama3."model-name".chat-model.temperature` Temperature in [0,inf] Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3__MODEL_NAME__CHAT_MODEL_TEMPERATURE`	double	`0.1`
`quarkus.langchain4j.llama3."model-name".chat-model.max-tokens` Number of steps to run for < 0 = limited by context length Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3__MODEL_NAME__CHAT_MODEL_MAX_TOKENS`	int	`512`
`quarkus.langchain4j.llama3."model-name".enable-integration` Whether to enable the integration. Set to `false` to disable all requests. Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3__MODEL_NAME__ENABLE_INTEGRATION`	boolean	`true`
`quarkus.langchain4j.llama3."model-name".log-requests` Whether Jlama should log requests Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3__MODEL_NAME__LOG_REQUESTS`	boolean	`false`
`quarkus.langchain4j.llama3."model-name".log-responses` Whether Jlama client should log responses Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3__MODEL_NAME__LOG_RESPONSES`	boolean	`false`

Configuration property

Type

Default

quarkus.langchain4j.llama3.include-models-in-artifact

Determines whether the necessary Jlama models are downloaded and included in the jar at build time. Currently, this option is only valid for fast-jar deployments.

Environment variable: QUARKUS_LANGCHAIN4J_LLAMA3_INCLUDE_MODELS_IN_ARTIFACT

boolean

true

quarkus.langchain4j.llama3.chat-model.enabled

Whether the model should be enabled

Environment variable: QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_ENABLED

boolean

true

quarkus.langchain4j.llama3.chat-model.model-name

Model name to use

Environment variable: QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_MODEL_NAME

string

mukel/Llama-3.2-1B-Instruct-GGUF

quarkus.langchain4j.llama3.chat-model.quantization

Quantization of the model to use

Environment variable: QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_QUANTIZATION

string

Q4_0

quarkus.langchain4j.llama3.chat-model.pre-load-in-native

Llama3.java supports AOT model preloading, enabling 0-overhead, instant inference, with minimal TTFT (time-to-first-token). A specialized, larger binary will be generated, with no parsing overhead for that particular model. It can still run other models, although incurring the usual parsing overhead.

Environment variable: QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_PRE_LOAD_IN_NATIVE

boolean

false

quarkus.langchain4j.llama3.models-path

Location on the file-system which serves as a cache for the models

Environment variable: QUARKUS_LANGCHAIN4J_LLAMA3_MODELS_PATH

path

${user.home}/.langchain4j/models

quarkus.langchain4j.llama3.chat-model.temperature

Temperature in [0,inf]

Environment variable: QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_TEMPERATURE

double