GPULlama3.java Chat Models
GPULlama3.java provides a Java-native implementation of LLMs that runs entirely in Java and executes automatically on GPUs via TornadoVM.
This extension allows Quarkus applications to use locally hosted LLMs (Llama3, Mistral, Qwen2.5, Deepseek-R1-Distill-Qwen, Qwen3, Phi3, IBM Granite 3.2+, IBM Granite 4.0) for chat-based inference, leveraging GPU acceleration without requiring native CUDA code.
Here you can see a collection of demo Quarkus applications with GPULlama3.java!
Prerequisites
Java Version and TornadoVM
GPULlama3.java requires TornadoVM and Java 21 or later due to its use of the Java Vector API.
Install TornadoVM with SDKMAN!:
sdk install tornadovm 2.2.0-opencl
Or, manually:
Linux (x86_64)
wget https://github.com/beehive-lab/TornadoVM/releases/download/v2.2.0/tornadovm-2.2.0-opencl-linux-amd64.zip
unzip tornadovm-2.2.0-opencl-linux-amd64.zip
export TORNADOVM_HOME="$(pwd)/tornadovm-2.2.0-opencl"
export PATH=$TORNADOVM_HOME/bin:$PATH
macOS (Apple Silicon)
wget https://github.com/beehive-lab/TornadoVM/releases/download/v2.2.0/tornadovm-2.2.0-opencl-mac-aarch64.zip
unzip tornadovm-2.2.0-opencl-mac-aarch64.zip
export TORNADOVM_HOME="$(pwd)/tornadovm-2.2.0-opencl"
export PATH=$TORNADOVM_HOME/bin:$PATH
To verify installation:
tornado --devices
tornado --version
The TornadoVM installation:
-
Sets the
TORNADOVM_HOMEenvironment variable to the TornadoVM SDK path. -
TORNADOVM_HOMEcontains thetornado-argfilewith all the JVM arguments required to enable TornadoVM. -
⚠️ The
tornado-argfileshould be used for building and running the Quarkus application (see section Building & Running the Quarkus Application).
Using GPULlama3.java
To integrate the GPULlama3 chat model into your Quarkus application, add the following dependency:
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-gpu-llama3</artifactId>
<version>1.5.0.CR2</version>
</dependency>
Important: If no other LLM extension is configured, AI Services will automatically use the GPU-accelerated GPULlama3!
Sample implementation for ChatModel:
@Path("chat")
public class ChatLanguageModelResource {
private final ChatModel chatModel;
public ChatLanguageModelResource(ChatModel chatModel) {
this.chatModel = chatModel;
}
@GET
@Path("blocking")
public String blocking() {
return chatModel.chat("When was the nobel prize for economics first awarded?");
}
}
Send requests to blocking endpoint:
curl http://localhost:8080/chat/blocking
Sample implementation for StreamingChatModel:
@Path("chat")
public class ChatLanguageModelResource {
private final StreamingChatModel streamingChatModel;
public ChatLanguageModelResource(StreamingChatModel streamingChatModel) {
this.streamingChatModel = streamingChatModel;
}
@GET
@Path("streaming")
@RestStreamElementType(MediaType.TEXT_PLAIN)
public Multi<String> streaming() {
return Multi.createFrom().emitter(emitter -> {
streamingChatModel.chat("When was the nobel prize for economics first awarded?",
new StreamingChatResponseHandler() {
@Override
public void onPartialResponse(String token) {
emitter.emit(token);
}
@Override
public void onError(Throwable error) {
emitter.fail(error);
}
@Override
public void onCompleteResponse(ChatResponse completeResponse) {
emitter.complete();
}
});
});
}
}
Send requests to streaming endpoint:
curl http://localhost:8080/chat/streaming
Configure GPULlama3
The GPULlama3 extension can be configured via standard Quarkus properties:
# Enable GPULlama3 integration
quarkus.langchain4j.gpu-llama3.enable-integration=true
# Select the default model
quarkus.langchain4j.gpu-llama3.chat-model.model-name=unsloth/Llama-3.2-1B-Instruct-GGUF
quarkus.langchain4j.gpu-llama3.chat-model.quantization=F16
quarkus.langchain4j.gpu-llama3.chat-model.temperature=0.7
quarkus.langchain4j.gpu-llama3.chat-model.max-tokens=1024
Model files are automatically downloaded from Beehive Lab HuggingFace if not available locally.
Building & Running the Quarkus Application
Dev Mode
To run your Quarkus application in dev mode with TornadoVM:
-
Ensure your
pom.xmlcontains thequarkus-langchain4j-gpu-llama3dependency (shown earlier). -
Add the TornadoVM argfile as a Maven property:
<properties>
<tornado.argfile>/path/to/tornado-argfile</tornado.argfile>
</properties>
-
Pass the argfile to the JVM in the plugin configuration for dev mode:
<plugin>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-maven-plugin</artifactId>
<configuration>
<jvmArgs>@${tornado.argfile}</jvmArgs>
</configuration>
</plugin>
-
Launch dev mode explicitly:
mvn quarkus:dev
Production Mode
To build and run your application in production mode:
-
Build the Quarkus application:
mvn clean package
-
Run the generated jar with the TornadoVM argfile:
java @$TORNADOVM_HOME/tornado-argfile -jar target/quarkus-app/quarkus-run.jar
⚠ Important: Ensure TORNADOVM_SDK and the tornado-argfile path are correctly set.
Supported Models and Quantizations
The following models have been tested with GPULlama3.java and can be found in Beehive Lab’s HuggingFace Collections.
⚠️ Important:
Quantization format names are model-dependent. Some models require F16, others fp16 (lowercase), as defined in their GGUF files.
Use exactly the spelling listed below, or the model may fail to load.
| Model | Quantizations | Model Identifier (Hugging Face) |
|---|---|---|
Llama 3.2 1B |
|
|
Llama 3.2 3B |
|
|
Llama 3.1 8B |
|
|
Mistral 7B |
|
|
Qwen2.5 0.5B |
|
|
Qwen2.5 1.5B |
|
|
DeepSeek-R1 Distill Qwen 1.5B |
|
|
DeepSeek-R1 Distill Qwen 7B |
|
|
Qwen3 0.6B |
|
|
Qwen3 1.7B |
|
|
Qwen3 4B |
|
|
Qwen3 8B |
|
|
Phi 3 Mini 4k |
|
|
Phi 3 Mini 4k |
|
|
Phi 3 Mini 128k |
|
|
Phi 3.1 Mini 128k |
|
|
Granite 3.2 2B |
|
|
Granite 3.2 8B |
|
|
Granite 3.3 2B |
|
|
Granite 3.3 8B |
|
|
Granite 4.0 1B |
|
|
Each entry corresponds to a GGUF model tested to run on TornadoVM via GPULlama3.java.
Configuration Reference
Configuration property fixed at build time - All other configuration properties are overridable at runtime
Configuration property |
Type |
Default |
|---|---|---|
Determines whether the necessary GPULlama3 models are downloaded and included in the jar at build time. Currently, this option is only valid for Environment variable: |
boolean |
|
Whether the model should be enabled Environment variable: |
boolean |
|
Model name to use Environment variable: |
string |
|
Quantization of the model to use Environment variable: |
string |
|
Location on the file-system which serves as a cache for the models Environment variable: |
path |
|
What sampling temperature to use, between 0.0 and 1.0. Environment variable: |
double |
|
What sampling topP to use, between 0.0 and 1.0. Environment variable: |
double |
|
What seed value to use. Environment variable: |
int |
|
The maximum number of tokens to generate in the completion. Environment variable: |
int |
|
Whether to enable the integration. Set to Environment variable: |
boolean |
|
Whether GPULlama3 should log requests Environment variable: |
boolean |
|
Whether GPULlama3 client should log responses Environment variable: |
boolean |
|
Type |
Default |
|
Model name to use Environment variable: |
string |
|
Quantization of the model to use Environment variable: |
string |
|
What sampling temperature to use, between 0.0 and 1.0. Environment variable: |
double |
|
What sampling topP to use, between 0.0 and 1.0. Environment variable: |
double |
|
What seed value to use. Environment variable: |
int |
|
The maximum number of tokens to generate in the completion. Environment variable: |
int |
|
Whether to enable the integration. Set to Environment variable: |
boolean |
|
Whether GPULlama3 should log requests Environment variable: |
boolean |
|
Whether GPULlama3 client should log responses Environment variable: |
boolean |
|