GPULlama3.java Chat Models

GPULlama3.java provides a Java-native implementation of Llama3 that runs entirely in Java and executes automatically on GPUs via TornadoVM.

This extension allows Quarkus applications to use locally hosted Llama3 and other compatible models (e.g., Mistral, Qwen3, Phi3) for chat-based inference, leveraging GPU acceleration without requiring native CUDA code.

Prerequisites

Java Version and TornadoVM

GPULlama3.java requires Java 21 or later due to its use of the Java Vector API and TornadoVM integration.

Install TornadoVM locally as follows:

cd ~
git clone git@github.com:beehive-lab/TornadoVM.git
cd ~/TornadoVM
./bin/tornadovm-installer --jdk jdk21 --backend opencl
source setvars.sh

The above steps:

  • Set the TORNADOVM_SDK environment variable to the TornadoVM SDK path.

  • Create a tornado-argfile under ~/TornadoVM containing the JVM arguments required to enable TornadoVM.

  • The tornado-argfile is automatically used in Quarkus dev mode.

  • For production mode, you must manually pass the argfile to the JVM (see step 3).

Using GPULlama3.java

To integrate the GPULlama3 chat model into your Quarkus application, add the following dependency:

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-gpu-llama3</artifactId>
    <version>1.4.0</version>
</dependency>

Important: If no other LLM extension is configured, AI Services will automatically use the GPU-accelerated GPULlama3!

Sample implementation for ChatModel:

@Path("chat")
public class ChatLanguageModelResource {
    private final ChatModel chatModel;
    public ChatLanguageModelResource(ChatModel chatModel) {
        this.chatModel = chatModel;
    }

    @GET
    @Path("blocking")
    public String blocking() {
        return chatModel.chat("When was the nobel prize for economics first awarded?");
    }
}

Send requests to blocking endpoint:

curl http://localhost:8080/chat/blocking

Sample implementation for StreamingChatModel:

@Path("chat")
public class ChatLanguageModelResource {
    private final StreamingChatModel streamingChatModel;
    public ChatLanguageModelResource(StreamingChatModel streamingChatModel) {
        this.streamingChatModel = streamingChatModel;
    }

    @GET
    @Path("streaming")
    @RestStreamElementType(MediaType.TEXT_PLAIN)
    public Multi<String> streaming() {
        return Multi.createFrom().emitter(emitter -> {
            streamingChatModel.chat("When was the nobel prize for economics first awarded?",
                    new StreamingChatResponseHandler() {
                        @Override
                        public void onPartialResponse(String token) {
                            emitter.emit(token);
                        }

                        @Override
                        public void onError(Throwable error) {
                            emitter.fail(error);
                        }

                        @Override
                        public void onCompleteResponse(ChatResponse completeResponse) {
                            emitter.complete();
                        }
                    });
        });
    }
}

Send requests to streaming endpoint:

curl http://localhost:8080/chat/streaming

Configure GPULlama3

The GPULlama3 extension can be configured via standard Quarkus properties:

# Enable GPULlama3 integration
quarkus.langchain4j.gpu-llama3.enable-integration=true

# Select the default model
quarkus.langchain4j.gpu-llama3.chat-model.model-name=unsloth/Llama-3.2-1B-Instruct-GGUF
quarkus.langchain4j.gpu-llama3.chat-model.quantization=F16
quarkus.langchain4j.gpu-llama3.chat-model.temperature=0.7
quarkus.langchain4j.gpu-llama3.chat-model.max-tokens=1024

Model files are automatically downloaded from Beehive Lab HuggingFace if not available locally.

Supported Models and Quantizations

The following models have been tested with GPULlama3.java and can be found in Beehive Lab’s HuggingFace Collections.

⚠️ Important: Quantization format names are model-dependent. Some models require F16, others fp16 (lowercase), as defined in their GGUF files. Use exactly the spelling listed below, or the model may fail to load.

Model Quantizations Model Identifier (Hugging Face)

Llama 3.2 1B

F16, Q8_0

unsloth/Llama-3.2-1B-Instruct-GGUF

Llama 3.2 3B

F16, Q8_0

unsloth/Llama-3.2-3B-Instruct-GGUF

Llama 3.1 8B

fp16, Q8_0

brittlewis12/Meta-Llama-3.1-8B-Instruct-GGUF

Mistral 7B

fp16, Q8_0

MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF

Qwen2.5 0.5B

F16, Q8_0

bartowski/Qwen2.5-0.5B-Instruct-GGUF

Qwen2.5 1.5B

fp16, Q8_0

Qwen/Qwen2.5-1.5B-Instruct-GGUF

DeepSeek-R1 Distill Qwen 1.5B

F16, Q8_0

hdnh2006/DeepSeek-R1-Distill-Qwen-1.5B-GGUF

DeepSeek-R1 Distill Qwen 7B

F16, Q8_0

XelotX/DeepSeek-R1-Distill-Qwen-7B-GGUF

Qwen3 0.6B

F16, Q8_0

ggml-org/Qwen3-0.6B-GGUF

Qwen3 1.7B

F16, Q8_0

ggml-org/Qwen3-1.7B-GGUF

Qwen3 4B

F16, Q8_0

ggml-org/Qwen3-4B-GGUF

Qwen3 8B

F16, Q8_0

ggml-org/Qwen3-8B-GGUF

Phi 3 Mini 4k

fp16

microsoft/Phi-3-mini-4k-instruct-gguf

Phi 3 Mini 4k

Q8_0

bartowski/Phi-3-mini-4k-instruct-GGUF

Phi 3 Mini 128k

Q8_0

QuantFactory/Phi-3-mini-4k-instruct-GGUF

Phi 3.1 Mini 128k

Q8_0

bartowski/Phi-3.1-mini-128k-instruct-GGUF

Each entry corresponds to a GGUF model tested to run on TornadoVM via GPULlama3.java.

Configuration Reference

Configuration property fixed at build time - All other configuration properties are overridable at runtime

Configuration property

Type

Default

Determines whether the necessary GPULlama3 models are downloaded and included in the jar at build time. Currently, this option is only valid for fast-jar deployments.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_INCLUDE_MODELS_IN_ARTIFACT

boolean

true

Whether the model should be enabled

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_ENABLED

boolean

true

Model name to use

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_MODEL_NAME

string

unsloth/Llama-3.2-1B-Instruct-GGUF

Quantization of the model to use

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_QUANTIZATION

string

F16

Location on the file-system which serves as a cache for the models

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_MODELS_PATH

path

${user.home}/.langchain4j/models

What sampling temperature to use, between 0.0 and 1.0.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_TEMPERATURE

double

0.3

What sampling topP to use, between 0.0 and 1.0.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_TOP_P

double

0.85

What seed value to use.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_SEED

int

1234

The maximum number of tokens to generate in the completion.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_CHAT_MODEL_MAX_TOKENS

int

512

Whether to enable the integration. Set to false to disable all requests.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_ENABLE_INTEGRATION

boolean

true

Whether GPULlama3 should log requests

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_LOG_REQUESTS

boolean

false

Whether GPULlama3 client should log responses

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3_LOG_RESPONSES

boolean

false

Named model config

Type

Default

Model name to use

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_MODEL_NAME

string

unsloth/Llama-3.2-1B-Instruct-GGUF

Quantization of the model to use

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_QUANTIZATION

string

F16

What sampling temperature to use, between 0.0 and 1.0.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_TEMPERATURE

double

0.3

What sampling topP to use, between 0.0 and 1.0.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_TOP_P

double

0.85

What seed value to use.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_SEED

int

1234

The maximum number of tokens to generate in the completion.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__CHAT_MODEL_MAX_TOKENS

int

512

Whether to enable the integration. Set to false to disable all requests.

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__ENABLE_INTEGRATION

boolean

true

Whether GPULlama3 should log requests

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__LOG_REQUESTS

boolean

false

Whether GPULlama3 client should log responses

Environment variable: QUARKUS_LANGCHAIN4J_GPU_LLAMA3__MODEL_NAME__LOG_RESPONSES

boolean

false

Limitations

  • TornadoVM currently does not support GraalVM Native Image builds.

  • Ensure that the TornadoVM environment (TORNADOVM_SDK and tornado-argfile) is properly set before running Quarkus.

  • Only Java 21 or newer versions are supported.