GPULlama3.java Chat Models

GPULlama3.java provides a Java-native implementation of Llama3 that runs entirely in Java and executes automatically on GPUs via TornadoVM.

This extension allows Quarkus applications to use locally hosted Llama3 and other compatible models (e.g., Mistral, Qwen3, Phi3) for chat-based inference, leveraging GPU acceleration without requiring native CUDA code.

Prerequisites

Java Version and TornadoVM

GPULlama3.java requires Java 21 or later due to its use of the Java Vector API and TornadoVM integration.

Install TornadoVM locally as follows:

cd ~
git clone git@github.com:beehive-lab/TornadoVM.git
cd ~/TornadoVM
./bin/tornadovm-installer --jdk jdk21 --backend opencl
source setvars.sh

The above steps:

Set the TORNADOVM_SDK environment variable to the TornadoVM SDK path.
Create a tornado-argfile under ~/TornadoVM containing the JVM arguments required to enable TornadoVM.
⚠️ The tornado-argfile should be used for building and running the Quarkus application (see section Building & Running the Quarkus Application).

Using GPULlama3.java

To integrate the GPULlama3 chat model into your Quarkus application, add the following dependency:

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-gpu-llama3</artifactId>
    <version>1.4.2</version>
</dependency>

Important: If no other LLM extension is configured, AI Services will automatically use the GPU-accelerated GPULlama3!

Sample implementation for ChatModel:

@Path("chat")
public class ChatLanguageModelResource {
    private final ChatModel chatModel;
    public ChatLanguageModelResource(ChatModel chatModel) {
        this.chatModel = chatModel;
    }

    @GET
    @Path("blocking")
    public String blocking() {
        return chatModel.chat("When was the nobel prize for economics first awarded?");
    }
}

Send requests to blocking endpoint:

curl http://localhost:8080/chat/blocking

Sample implementation for StreamingChatModel:

@Path("chat")
public class ChatLanguageModelResource {
    private final StreamingChatModel streamingChatModel;
    public ChatLanguageModelResource(StreamingChatModel streamingChatModel) {
        this.streamingChatModel = streamingChatModel;
    }

    @GET
    @Path("streaming")
    @RestStreamElementType(MediaType.TEXT_PLAIN)
    public Multi<String> streaming() {
        return Multi.createFrom().emitter(emitter -> {
            streamingChatModel.chat("When was the nobel prize for economics first awarded?",
                    new StreamingChatResponseHandler() {
                        @Override
                        public void onPartialResponse(String token) {
                            emitter.emit(token);
                        }

                        @Override
                        public void onError(Throwable error) {
                            emitter.fail(error);
                        }

                        @Override
                        public void onCompleteResponse(ChatResponse completeResponse) {
                            emitter.complete();
                        }
                    });
        });
    }
}

Send requests to streaming endpoint:

curl http://localhost:8080/chat/streaming

Configure GPULlama3

The GPULlama3 extension can be configured via standard Quarkus properties:

# Enable GPULlama3 integration
quarkus.langchain4j.gpu-llama3.enable-integration=true

# Select the default model
quarkus.langchain4j.gpu-llama3.chat-model.model-name=unsloth/Llama-3.2-1B-Instruct-GGUF
quarkus.langchain4j.gpu-llama3.chat-model.quantization=F16
quarkus.langchain4j.gpu-llama3.chat-model.temperature=0.7
quarkus.langchain4j.gpu-llama3.chat-model.max-tokens=1024

Model files are automatically downloaded from Beehive Lab HuggingFace if not available locally.

Building & Running the Quarkus Application

Dev Mode

To run your Quarkus application in dev mode with TornadoVM:

Ensure your pom.xml contains the quarkus-langchain4j-gpu-llama3 dependency (shown earlier).
Add the TornadoVM argfile as a Maven property:

<properties>
    <tornado.argfile>/path/to/tornado-argfile</tornado.argfile>
</properties>

Pass the argfile to the JVM in the plugin configuration for dev mode:

<plugin>
    <groupId>io.quarkus</groupId>
    <artifactId>quarkus-maven-plugin</artifactId>
    <configuration>
        <jvmArgs>@${tornado.argfile}</jvmArgs>
    </configuration>
</plugin>

Launch dev mode explicitly:

mvn quarkus:dev

Production Mode

To build and run your application in production mode:

Build the Quarkus application:

mvn clean package

Run the generated jar with the TornadoVM argfile:

java @/path/to/tornado-argfile -jar target/quarkus-app/quarkus-run.jar

⚠ Important: Ensure TORNADOVM_SDK and the tornado-argfile path are correctly set.

Supported Models and Quantizations

The following models have been tested with GPULlama3.java and can be found in Beehive Lab’s HuggingFace Collections.

⚠️ Important: Quantization format names are model-dependent. Some models require F16, others fp16 (lowercase), as defined in their GGUF files. Use exactly the spelling listed below, or the model may fail to load.

Model Quantizations Model Identifier (Hugging Face)

Model	Quantizations	Model Identifier (Hugging Face)
Llama 3.2 1B	`F16`, `Q8_0`	`unsloth/Llama-3.2-1B-Instruct-GGUF`
Llama 3.2 3B	`F16`, `Q8_0`	`unsloth/Llama-3.2-3B-Instruct-GGUF`
Llama 3.1 8B	`fp16`, `Q8_0`	`brittlewis12/Meta-Llama-3.1-8B-Instruct-GGUF`
Mistral 7B	`fp16`, `Q8_0`	`MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF`
Qwen2.5 0.5B	`F16`, `Q8_0`	`bartowski/Qwen2.5-0.5B-Instruct-GGUF`
Qwen2.5 1.5B	`fp16`, `Q8_0`	`Qwen/Qwen2.5-1.5B-Instruct-GGUF`
DeepSeek-R1 Distill Qwen 1.5B	`F16`, `Q8_0`	`hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B-GGUF`
DeepSeek-R1 Distill Qwen 7B	`F16`, `Q8_0`	`XelotX/DeepSeek-R1-Distill-Qwen-7B-GGUF`
Qwen3 0.6B	`F16`, `Q8_0`	`ggml-org/Qwen3-0.6B-GGUF`
Qwen3 1.7B	`F16`, `Q8_0`	`ggml-org/Qwen3-1.7B-GGUF`
Qwen3 4B	`F16`, `Q8_0`	`ggml-org/Qwen3-4B-GGUF`
Qwen3 8B	`F16`, `Q8_0`	`ggml-org/Qwen3-8B-GGUF`
Phi 3 Mini 4k	`fp16`	`microsoft/Phi-3-mini-4k-instruct-gguf`
Phi 3 Mini 4k	`Q8_0`	`bartowski/Phi-3-mini-4k-instruct-GGUF`
Phi 3 Mini 128k	`Q8_0`	`QuantFactory/Phi-3-mini-4k-instruct-GGUF`
Phi 3.1 Mini 128k	`Q8_0`	`bartowski/Phi-3.1-mini-128k-instruct-GGUF`

Llama 3.2 1B

F16, Q8_0

unsloth/Llama-3.2-1B-Instruct-GGUF

Llama 3.2 3B

F16, Q8_0

unsloth/Llama-3.2-3B-Instruct-GGUF

Llama 3.1 8B

fp16, Q8_0

brittlewis12/Meta-Llama-3.1-8B-Instruct-GGUF

Mistral 7B

fp16, Q8_0

MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF

Qwen2.5 0.5B

F16, Q8_0

bartowski/Qwen2.5-0.5B-Instruct-GGUF

Qwen2.5 1.5B

fp16, Q8_0

Qwen/Qwen2.5-1.5B-Instruct-GGUF

DeepSeek-R1 Distill Qwen 1.5B

F16, Q8_0

hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B-GGUF

DeepSeek-R1 Distill Qwen 7B

F16, Q8_0

XelotX/DeepSeek-R1-Distill-Qwen-7B-GGUF

Qwen3 0.6B

F16, Q8_0

ggml-org/Qwen3-0.6B-GGUF

Qwen3 1.7B

F16, Q8_0

ggml-org/Qwen3-1.7B-GGUF

Qwen3 4B

F16, Q8_0

ggml-org/Qwen3-4B-GGUF

Qwen3 8B

F16, Q8_0

ggml-org/Qwen3-8B-GGUF

Phi 3 Mini 4k

fp16

microsoft/Phi-3-mini-4k-instruct-gguf

Phi 3 Mini 4k

Q8_0

bartowski/Phi-3-mini-4k-instruct-GGUF

Phi 3 Mini 128k

Q8_0

QuantFactory/Phi-3-mini-4k-instruct-GGUF

Phi 3.1 Mini 128k

Q8_0

bartowski/Phi-3.1-mini-128k-instruct-GGUF

Each entry corresponds to a GGUF model tested to run on TornadoVM via GPULlama3.java.

Configuration Reference

Configuration property fixed at build time - All other configuration properties are overridable at runtime

Configuration property	Type	Default
`quarkus.langchain4j.gpu-llama3.include-models-in-artifact` Determines whether the necessary GPULlama3 models are downloaded and included in the jar at build time. Currently, this option is only valid for `fast-jar` deployments. Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3_INCLUDE_MODELS_IN_ARTIFACT`	boolean	`true`
`quarkus.langchain4j.gpu-llama3.chat-model.enabled` Whether the model should be enabled Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_ENABLED`	boolean	`true`
`quarkus.langchain4j.gpu-llama3.chat-model.model-name` Model name to use Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_MODEL_NAME`	string	`unsloth/Llama-3.2-1B-Instruct-GGUF`
`quarkus.langchain4j.gpu-llama3.chat-model.quantization` Quantization of the model to use Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_QUANTIZATION`	string	`F16`
`quarkus.langchain4j.gpu-llama3.models-path` Location on the file-system which serves as a cache for the models Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3_MODELS_PATH`	path	`${user.home}/.langchain4j/models`
`quarkus.langchain4j.gpu-llama3.chat-model.temperature` What sampling temperature to use, between 0.0 and 1.0. Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_TEMPERATURE`	double	`0.3`
`quarkus.langchain4j.gpu-llama3.chat-model.top-p` What sampling topP to use, between 0.0 and 1.0. Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_TOP_P`	double	`0.85`
`quarkus.langchain4j.gpu-llama3.chat-model.seed` What seed value to use. Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_SEED`	int	`1234`
`quarkus.langchain4j.gpu-llama3.chat-model.max-tokens` The maximum number of tokens to generate in the completion. Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_MAX_TOKENS`	int	`512`
`quarkus.langchain4j.gpu-llama3.enable-integration` Whether to enable the integration. Set to `false` to disable all requests. Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3_ENABLE_INTEGRATION`	boolean	`true`
`quarkus.langchain4j.gpu-llama3.log-requests` Whether GPULlama3 should log requests Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3_LOG_REQUESTS`	boolean	`false`
`quarkus.langchain4j.gpu-llama3.log-responses` Whether GPULlama3 client should log responses Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3_LOG_RESPONSES`	boolean	`false`
Named model config	Type	Default
`quarkus.langchain4j.gpu-llama3."model-name".chat-model.model-name` Model name to use Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_MODEL_NAME`	string	`unsloth/Llama-3.2-1B-Instruct-GGUF`
`quarkus.langchain4j.gpu-llama3."model-name".chat-model.quantization` Quantization of the model to use Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_QUANTIZATION`	string	`F16`
`quarkus.langchain4j.gpu-llama3."model-name".chat-model.temperature` What sampling temperature to use, between 0.0 and 1.0. Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_TEMPERATURE`	double	`0.3`
`quarkus.langchain4j.gpu-llama3."model-name".chat-model.top-p` What sampling topP to use, between 0.0 and 1.0. Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_TOP_P`	double	`0.85`
`quarkus.langchain4j.gpu-llama3."model-name".chat-model.seed` What seed value to use. Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_SEED`	int	`1234`
`quarkus.langchain4j.gpu-llama3."model-name".chat-model.max-tokens` The maximum number of tokens to generate in the completion. Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_MAX_TOKENS`	int	`512`
`quarkus.langchain4j.gpu-llama3."model-name".enable-integration` Whether to enable the integration. Set to `false` to disable all requests. Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__ENABLE_INTEGRATION`	boolean	`true`
`quarkus.langchain4j.gpu-llama3."model-name".log-requests` Whether GPULlama3 should log requests Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__LOG_REQUESTS`	boolean	`false`
`quarkus.langchain4j.gpu-llama3."model-name".log-responses` Whether GPULlama3 client should log responses Environment variable: `QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__LOG_RESPONSES`	boolean	`false`

Configuration property

Type

Default

quarkus.langchain4j.gpu-llama3.include-models-in-artifact

Determines whether the necessary GPULlama3 models are downloaded and included in the jar at build time. Currently, this option is only valid for fast-jar deployments.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_INCLUDE_MODELS_IN_ARTIFACT

boolean

true

quarkus.langchain4j.gpu-llama3.chat-model.enabled

Whether the model should be enabled

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_ENABLED

boolean

true

quarkus.langchain4j.gpu-llama3.chat-model.model-name

Model name to use

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_MODEL_NAME

string

unsloth/Llama-3.2-1B-Instruct-GGUF

quarkus.langchain4j.gpu-llama3.chat-model.quantization

Quantization of the model to use

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_QUANTIZATION

string

F16

quarkus.langchain4j.gpu-llama3.models-path

Location on the file-system which serves as a cache for the models

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_MODELS_PATH

path

${user.home}/.langchain4j/models

quarkus.langchain4j.gpu-llama3.chat-model.temperature

What sampling temperature to use, between 0.0 and 1.0.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_TEMPERATURE

double

0.3

quarkus.langchain4j.gpu-llama3.chat-model.top-p

What sampling topP to use, between 0.0 and 1.0.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_TOP_P

double

0.85

quarkus.langchain4j.gpu-llama3.chat-model.seed

What seed value to use.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_SEED

int

1234

quarkus.langchain4j.gpu-llama3.chat-model.max-tokens

The maximum number of tokens to generate in the completion.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_MAX_TOKENS

int

512

quarkus.langchain4j.gpu-llama3.enable-integration

Whether to enable the integration. Set to false to disable all requests.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_ENABLE_INTEGRATION

boolean

true

quarkus.langchain4j.gpu-llama3.log-requests

Whether GPULlama3 should log requests

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_LOG_REQUESTS

boolean

false

quarkus.langchain4j.gpu-llama3.log-responses

Whether GPULlama3 client should log responses

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_LOG_RESPONSES

boolean

false

Named model config

Type

Default

quarkus.langchain4j.gpu-llama3."model-name".chat-model.model-name

Model name to use

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_MODEL_NAME

string

unsloth/Llama-3.2-1B-Instruct-GGUF

quarkus.langchain4j.gpu-llama3."model-name".chat-model.quantization

Quantization of the model to use

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_QUANTIZATION

string

F16

quarkus.langchain4j.gpu-llama3."model-name".chat-model.temperature

What sampling temperature to use, between 0.0 and 1.0.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_TEMPERATURE

double

0.3

quarkus.langchain4j.gpu-llama3."model-name".chat-model.top-p

What sampling topP to use, between 0.0 and 1.0.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_TOP_P

double

0.85

quarkus.langchain4j.gpu-llama3."model-name".chat-model.seed

What seed value to use.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_SEED

int

1234

quarkus.langchain4j.gpu-llama3."model-name".chat-model.max-tokens

The maximum number of tokens to generate in the completion.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_MAX_TOKENS

int

512

quarkus.langchain4j.gpu-llama3."model-name".enable-integration

Whether to enable the integration. Set to false to disable all requests.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__ENABLE_INTEGRATION

boolean

true

quarkus.langchain4j.gpu-llama3."model-name".log-requests

Whether GPULlama3 should log requests

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__LOG_REQUESTS

boolean

false

quarkus.langchain4j.gpu-llama3."model-name".log-responses

Whether GPULlama3 client should log responses

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__LOG_RESPONSES

boolean

false

Limitations

TornadoVM currently does not support GraalVM Native Image builds.
Ensure that the TornadoVM environment (TORNADOVM_SDK and tornado-argfile) is properly set before running Quarkus.
Only Java 21 or newer versions are supported.