Jlama Chat Models

Jlama provides a way to run Large Language Models (LLMs) locally and in pure Java, embedded within your Quarkus application. It supports a growing set of models available on Hugging Face: https://huggingface.co/tjake.

Prerequisites

Java Version and Vector API

Jlama requires Java 21 or later because it leverages the Java Vector API for efficient inference. As this is a preview feature, you must enable it explicitly at runtime:

--enable-preview --enable-native-access=ALL-UNNAMED --add-modules jdk.incubator.vector

Dev Mode Support

When using Dev Mode:

  • The extension will automatically pull the configured model.

  • JVM flags are set up automatically to enable the C2 compiler, which is required for proper inference performance.

  • Disk space is required for downloaded models. The model directory can be customized via:

quarkus.langchain4j.jlama.models-path=/path/to/model/storage

Jlama models can be large (several GB) and may take time to download and initialize.

Using Jlama

To integrate Jlama into your Quarkus project, add the following dependency:

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-jlama</artifactId>
    <version>1.0.2</version>
</dependency>

If no other LLM extension is installed, AI Services will automatically use the configured Jlama chat model.

Chat Model Configuration

By default, Jlama uses the TinyLlama-1.1B-Chat-v1.0-Jlama-Q4 model:

quarkus.langchain4j.jlama.chat-model.model-name=tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4

To switch to another model, such as Granite, update the model name:

quarkus.langchain4j.jlama.chat-model.model-name=tjake/granite-3.0-2b-instruct-JQ4

Configuration Reference

Configuration property fixed at build time - All other configuration properties are overridable at runtime

Configuration property

Type

Default

Determines whether the necessary Jlama models are downloaded and included in the jar at build time. Currently, this option is only valid for fast-jar deployments.

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_INCLUDE_MODELS_IN_ARTIFACT

boolean

true

Whether the model should be enabled

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_CHAT_MODEL_ENABLED

boolean

true

Whether the model should be enabled

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_EMBEDDING_MODEL_ENABLED

boolean

true

Model name to use

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_CHAT_MODEL_MODEL_NAME

string

tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4

Model name to use

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_EMBEDDING_MODEL_MODEL_NAME

string

intfloat/e5-small-v2

Location on the file-system which serves as a cache for the models

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_MODELS_PATH

path

${user.home}/.langchain4j/models

What sampling temperature to use, between 0.0 and 1.0. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

It is generally recommended to set this or the top-k property but not both.

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_CHAT_MODEL_TEMPERATURE

double

0.3f

The maximum number of tokens to generate in the completion.

The token count of your prompt plus max_tokens cannot exceed the model’s context length

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_CHAT_MODEL_MAX_TOKENS

int

Whether to enable the integration. Set to false to disable all requests.

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_ENABLE_INTEGRATION

boolean

true

Whether Jlama should log requests

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_LOG_REQUESTS

boolean

false

Whether Jlama client should log responses

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_LOG_RESPONSES

boolean

false

Named model config

Type

Default

Model name to use

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__CHAT_MODEL_MODEL_NAME

string

tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4

Model name to use

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__EMBEDDING_MODEL_MODEL_NAME

string

intfloat/e5-small-v2

What sampling temperature to use, between 0.0 and 1.0. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

It is generally recommended to set this or the top-k property but not both.

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__CHAT_MODEL_TEMPERATURE

double

0.3f

The maximum number of tokens to generate in the completion.

The token count of your prompt plus max_tokens cannot exceed the model’s context length

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__CHAT_MODEL_MAX_TOKENS

int

Whether to enable the integration. Set to false to disable all requests.

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__ENABLE_INTEGRATION

boolean

true

Whether Jlama should log requests

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__LOG_REQUESTS

boolean

false

Whether Jlama client should log responses

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__LOG_RESPONSES

boolean

false