GPULlama3.java Chat Models
GPULlama3.java provides a Java-native implementation of Llama3 that runs entirely in Java and executes automatically on GPUs via TornadoVM.
This extension allows Quarkus applications to use locally hosted Llama3 and other compatible models (e.g., Mistral, Qwen3, Phi3) for chat-based inference, leveraging GPU acceleration without requiring native CUDA code.
Prerequisites
Java Version and TornadoVM
GPULlama3.java requires Java 21 or later due to its use of the Java Vector API and TornadoVM integration.
Install TornadoVM locally as follows:
cd ~
git clone git@github.com:beehive-lab/TornadoVM.git
cd ~/TornadoVM
./bin/tornadovm-installer --jdk jdk21 --backend opencl
source setvars.sh
The above steps:
-
Set the
TORNADOVM_SDKenvironment variable to the TornadoVM SDK path. -
Create a
tornado-argfileunder~/TornadoVMcontaining the JVM arguments required to enable TornadoVM. -
The
tornado-argfileis automatically used in Quarkus dev mode. -
For production mode, you must manually pass the argfile to the JVM (see step 3).
Using GPULlama3.java
To integrate the GPULlama3 chat model into your Quarkus application, add the following dependency:
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-gpu-llama3</artifactId>
<version>1.4.0</version>
</dependency>
Important: If no other LLM extension is configured, AI Services will automatically use the GPU-accelerated GPULlama3!
Sample implementation for ChatModel:
@Path("chat")
public class ChatLanguageModelResource {
private final ChatModel chatModel;
public ChatLanguageModelResource(ChatModel chatModel) {
this.chatModel = chatModel;
}
@GET
@Path("blocking")
public String blocking() {
return chatModel.chat("When was the nobel prize for economics first awarded?");
}
}
Send requests to blocking endpoint:
curl http://localhost:8080/chat/blocking
Sample implementation for StreamingChatModel:
@Path("chat")
public class ChatLanguageModelResource {
private final StreamingChatModel streamingChatModel;
public ChatLanguageModelResource(StreamingChatModel streamingChatModel) {
this.streamingChatModel = streamingChatModel;
}
@GET
@Path("streaming")
@RestStreamElementType(MediaType.TEXT_PLAIN)
public Multi<String> streaming() {
return Multi.createFrom().emitter(emitter -> {
streamingChatModel.chat("When was the nobel prize for economics first awarded?",
new StreamingChatResponseHandler() {
@Override
public void onPartialResponse(String token) {
emitter.emit(token);
}
@Override
public void onError(Throwable error) {
emitter.fail(error);
}
@Override
public void onCompleteResponse(ChatResponse completeResponse) {
emitter.complete();
}
});
});
}
}
Send requests to streaming endpoint:
curl http://localhost:8080/chat/streaming
Configure GPULlama3
The GPULlama3 extension can be configured via standard Quarkus properties:
# Enable GPULlama3 integration
quarkus.langchain4j.gpu-llama3.enable-integration=true
# Select the default model
quarkus.langchain4j.gpu-llama3.chat-model.model-name=unsloth/Llama-3.2-1B-Instruct-GGUF
quarkus.langchain4j.gpu-llama3.chat-model.quantization=F16
quarkus.langchain4j.gpu-llama3.chat-model.temperature=0.7
quarkus.langchain4j.gpu-llama3.chat-model.max-tokens=1024
Model files are automatically downloaded from Beehive Lab HuggingFace if not available locally.
Supported Models and Quantizations
The following models have been tested with GPULlama3.java and can be found in Beehive Lab’s HuggingFace Collections.
⚠️ Important:
Quantization format names are model-dependent. Some models require F16, others fp16 (lowercase), as defined in their GGUF files.
Use exactly the spelling listed below, or the model may fail to load.
| Model | Quantizations | Model Identifier (Hugging Face) |
|---|---|---|
Llama 3.2 1B |
|
|
Llama 3.2 3B |
|
|
Llama 3.1 8B |
|
|
Mistral 7B |
|
|
Qwen2.5 0.5B |
|
|
Qwen2.5 1.5B |
|
|
DeepSeek-R1 Distill Qwen 1.5B |
|
|
DeepSeek-R1 Distill Qwen 7B |
|
|
Qwen3 0.6B |
|
|
Qwen3 1.7B |
|
|
Qwen3 4B |
|
|
Qwen3 8B |
|
|
Phi 3 Mini 4k |
|
|
Phi 3 Mini 4k |
|
|
Phi 3 Mini 128k |
|
|
Phi 3.1 Mini 128k |
|
|
Each entry corresponds to a GGUF model tested to run on TornadoVM via GPULlama3.java.
Configuration Reference
Configuration property fixed at build time - All other configuration properties are overridable at runtime
Configuration property |
Type |
Default |
|---|---|---|
Determines whether the necessary GPULlama3 models are downloaded and included in the jar at build time. Currently, this option is only valid for Environment variable: |
boolean |
|
Whether the model should be enabled Environment variable: |
boolean |
|
Model name to use Environment variable: |
string |
|
Quantization of the model to use Environment variable: |
string |
|
Location on the file-system which serves as a cache for the models Environment variable: |
path |
|
What sampling temperature to use, between 0.0 and 1.0. Environment variable: |
double |
|
What sampling topP to use, between 0.0 and 1.0. Environment variable: |
double |
|
What seed value to use. Environment variable: |
int |
|
The maximum number of tokens to generate in the completion. Environment variable: |
int |
|
Whether to enable the integration. Set to Environment variable: |
boolean |
|
Whether GPULlama3 should log requests Environment variable: |
boolean |
|
Whether GPULlama3 client should log responses Environment variable: |
boolean |
|
Type |
Default |
|
Model name to use Environment variable: |
string |
|
Quantization of the model to use Environment variable: |
string |
|
What sampling temperature to use, between 0.0 and 1.0. Environment variable: |
double |
|
What sampling topP to use, between 0.0 and 1.0. Environment variable: |
double |
|
What seed value to use. Environment variable: |
int |
|
The maximum number of tokens to generate in the completion. Environment variable: |
int |
|
Whether to enable the integration. Set to Environment variable: |
boolean |
|
Whether GPULlama3 should log requests Environment variable: |
boolean |
|
Whether GPULlama3 client should log responses Environment variable: |
boolean |
|