Llama3.java

Llama3.java provides a way to run large language models (LLMs) locally and in pure Java and embedded in your Quarkus application. You can run various models such as LLama3, Mistral on your machine.

Prerequisites

To use Llama3.java it is necessary to run on Java 21 or later. This is because it utilizes the new Vector API for faster inference. Note that the Vector API is still a Java preview features, so it is required to explicitly enable it.

Since the Vector API are still a preview feature in Java 21, and up to the latest Java 23, it is necessary to enable it on the JVM by launching it with the following flags:

--enable-preview --enable-native-access=ALL-UNNAMED --add-modules jdk.incubator.vector

Dev Mode

Quarkus LangChain4j automatically handles the pulling of the models configured by the application, so there is no need for users to do so manually. Furthermore, the extension properly configures the launch of Java process in order to ensure that the C2 compiler will be enabled (as without it, Llama3.java is virtually unusable).

Models are generally very large and can take time to download while also consuming a large chunk of disk space. Models location can be controlled using quarkus.langchain4j.llama3.models-path property.

Native mode

Currently, Llama3.java only works in native mode with Early Access version’s of Oracle GraalVM 24 (which can be easily downloaded with SDKMan).

To achieve the best performance in native mode, it is suggested to configure the application with the following:

quarkus.native.additional-build-args=-O3,-march=native

Using Llama3.java

To let Llama3.java running inference on your models, add the following dependency into your project:

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-llama3-java</artifactId>
    <version>1.0.1</version>
</dependency>

If no other LLM extension is installed, AI Services will automatically utilize the configured Llama3.java model.

By default, the extension uses as model mukel/Llama-3.2-1B-Instruct-GGUF. You can change it by setting the quarkus.langchain4j.llama3.chat-model.model-name property in the application.properties file:

quarkus.langchain4j.llama3.chat-model.model-name=mukel/Llama-3.2-3B-Instruct-GGUF

Configuration

Several configuration properties are available:

Configuration property fixed at build time - All other configuration properties are overridable at runtime

Configuration property	Type	Default
`quarkus.langchain4j.llama3.include-models-in-artifact` Determines whether the necessary Jlama models are downloaded and included in the jar at build time. Currently, this option is only valid for `fast-jar` deployments. Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_INCLUDE_MODELS_IN_ARTIFACT`	boolean	`true`
`quarkus.langchain4j.llama3.chat-model.enabled` Whether the model should be enabled Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_ENABLED`	boolean	`true`
`quarkus.langchain4j.llama3.chat-model.model-name` Model name to use Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_MODEL_NAME`	string	`mukel/Llama-3.2-1B-Instruct-GGUF`
`quarkus.langchain4j.llama3.chat-model.quantization` Quantization of the model to use Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_QUANTIZATION`	string	`Q4_0`
`quarkus.langchain4j.llama3.chat-model.pre-load-in-native` Llama3.java supports AOT model preloading, enabling 0-overhead, instant inference, with minimal TTFT (time-to-first-token). A specialized, larger binary will be generated, with no parsing overhead for that particular model. It can still run other models, although incurring the usual parsing overhead. Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_PRE_LOAD_IN_NATIVE`	boolean	`false`
`quarkus.langchain4j.llama3.models-path` Location on the file-system which serves as a cache for the models Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_MODELS_PATH`	path	`${user.home}/.langchain4j/models`
`quarkus.langchain4j.llama3.chat-model.temperature` Temperature in [0,inf] Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_TEMPERATURE`	double	`0.1`
`quarkus.langchain4j.llama3.chat-model.max-tokens` Number of steps to run for < 0 = limited by context length Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_MAX_TOKENS`	int	`512`
`quarkus.langchain4j.llama3.enable-integration` Whether to enable the integration. Set to `false` to disable all requests. Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_ENABLE_INTEGRATION`	boolean	`true`
`quarkus.langchain4j.llama3.log-requests` Whether Jlama should log requests Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_LOG_REQUESTS`	boolean	`false`
`quarkus.langchain4j.llama3.log-responses` Whether Jlama client should log responses Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3_LOG_RESPONSES`	boolean	`false`
Named model config	Type	Default
`quarkus.langchain4j.llama3."model-name".chat-model.model-name` Model name to use Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3__MODEL_NAME__CHAT_MODEL_MODEL_NAME`	string	`mukel/Llama-3.2-1B-Instruct-GGUF`
`quarkus.langchain4j.llama3."model-name".chat-model.quantization` Quantization of the model to use Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3__MODEL_NAME__CHAT_MODEL_QUANTIZATION`	string	`Q4_0`
`quarkus.langchain4j.llama3."model-name".chat-model.pre-load-in-native` Llama3.java supports AOT model preloading, enabling 0-overhead, instant inference, with minimal TTFT (time-to-first-token). A specialized, larger binary will be generated, with no parsing overhead for that particular model. It can still run other models, although incurring the usual parsing overhead. Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3__MODEL_NAME__CHAT_MODEL_PRE_LOAD_IN_NATIVE`	boolean	`false`
`quarkus.langchain4j.llama3."model-name".chat-model.temperature` Temperature in [0,inf] Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3__MODEL_NAME__CHAT_MODEL_TEMPERATURE`	double	`0.1`
`quarkus.langchain4j.llama3."model-name".chat-model.max-tokens` Number of steps to run for < 0 = limited by context length Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3__MODEL_NAME__CHAT_MODEL_MAX_TOKENS`	int	`512`
`quarkus.langchain4j.llama3."model-name".enable-integration` Whether to enable the integration. Set to `false` to disable all requests. Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3__MODEL_NAME__ENABLE_INTEGRATION`	boolean	`true`
`quarkus.langchain4j.llama3."model-name".log-requests` Whether Jlama should log requests Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3__MODEL_NAME__LOG_REQUESTS`	boolean	`false`
`quarkus.langchain4j.llama3."model-name".log-responses` Whether Jlama client should log responses Environment variable: `QUARKUS_LANGCHAIN4J_LLAMA3__MODEL_NAME__LOG_RESPONSES`	boolean	`false`

Configuration property

Type

Default

quarkus.langchain4j.llama3.include-models-in-artifact

Determines whether the necessary Jlama models are downloaded and included in the jar at build time. Currently, this option is only valid for fast-jar deployments.

Environment variable: QUARKUS_LANGCHAIN4J_LLAMA3_INCLUDE_MODELS_IN_ARTIFACT

boolean

true

quarkus.langchain4j.llama3.chat-model.enabled

Whether the model should be enabled

Environment variable: QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_ENABLED

boolean

true

quarkus.langchain4j.llama3.chat-model.model-name

Model name to use

Environment variable: QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_MODEL_NAME

string

mukel/Llama-3.2-1B-Instruct-GGUF

quarkus.langchain4j.llama3.chat-model.quantization

Quantization of the model to use

Environment variable: QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_QUANTIZATION

string

Q4_0

quarkus.langchain4j.llama3.chat-model.pre-load-in-native

Llama3.java supports AOT model preloading, enabling 0-overhead, instant inference, with minimal TTFT (time-to-first-token). A specialized, larger binary will be generated, with no parsing overhead for that particular model. It can still run other models, although incurring the usual parsing overhead.

Environment variable: QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_PRE_LOAD_IN_NATIVE

boolean

false

quarkus.langchain4j.llama3.models-path

Location on the file-system which serves as a cache for the models

Environment variable: QUARKUS_LANGCHAIN4J_LLAMA3_MODELS_PATH

path

${user.home}/.langchain4j/models

quarkus.langchain4j.llama3.chat-model.temperature

Temperature in [0,inf]

Environment variable: QUARKUS_LANGCHAIN4J_LLAMA3_CHAT_MODEL_TEMPERATURE

double