Jlama Embedding Models

Jlama provides local embedding models suitable for RAG (Retriever-Augmented Generation), semantic search, and document classification—all without leaving the Java process.

Prerequisites

Jlama embedding models require Java 21 or later with the Vector API preview feature enabled:

--enable-preview --enable-native-access=ALL-UNNAMED --add-modules jdk.incubator.vector

See Jlama Chat Models for Dev Mode details and model setup.

Using Jlama Embeddings

To enable embedding model support, include:

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-jlama</artifactId>
    <version>1.1.0</version>
</dependency>

Default Model

By default, the embedding model is set to: intfloat/e5-small-v2

You can override the embedding model configuration:

quarkus.langchain4j.jlama.embedding-model.model-name=intfloat/e5-small-v2

Example of using both chat and embedding models:

quarkus.langchain4j.log-requests=true
quarkus.langchain4j.log-responses=true

quarkus.langchain4j.jlama.chat-model.model-name=tjake/granite-3.0-2b-instruct-JQ4
quarkus.langchain4j.jlama.embedding-model.model-name=intfloat/e5-small-v2

Programmatic Access

To inject the embedding model programmatically:

@Inject EmbeddingModel model;

This allows direct access for use in retrievers, RAG pipelines, or semantic search.

Configuration Reference

Configuration property fixed at build time - All other configuration properties are overridable at runtime

Configuration property	Type	Default
`quarkus.langchain4j.jlama.include-models-in-artifact` Determines whether the necessary Jlama models are downloaded and included in the jar at build time. Currently, this option is only valid for `fast-jar` deployments. Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA_INCLUDE_MODELS_IN_ARTIFACT`	boolean	`true`
`quarkus.langchain4j.jlama.chat-model.enabled` Whether the model should be enabled Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA_CHAT_MODEL_ENABLED`	boolean	`true`
`quarkus.langchain4j.jlama.embedding-model.enabled` Whether the model should be enabled Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA_EMBEDDING_MODEL_ENABLED`	boolean	`true`
`quarkus.langchain4j.jlama.chat-model.model-name` Model name to use Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA_CHAT_MODEL_MODEL_NAME`	string	`tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4`
`quarkus.langchain4j.jlama.embedding-model.model-name` Model name to use Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA_EMBEDDING_MODEL_MODEL_NAME`	string	`intfloat/e5-small-v2`
`quarkus.langchain4j.jlama.models-path` Location on the file-system which serves as a cache for the models Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA_MODELS_PATH`	path	`${user.home}/.langchain4j/models`
`quarkus.langchain4j.jlama.chat-model.temperature` What sampling temperature to use, between 0.0 and 1.0. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. It is generally recommended to set this or the `top-k` property but not both. Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA_CHAT_MODEL_TEMPERATURE`	double	`0.3f`
`quarkus.langchain4j.jlama.chat-model.max-tokens` The maximum number of tokens to generate in the completion. The token count of your prompt plus `max_tokens` cannot exceed the model’s context length Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA_CHAT_MODEL_MAX_TOKENS`	int
`quarkus.langchain4j.jlama.enable-integration` Whether to enable the integration. Set to `false` to disable all requests. Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA_ENABLE_INTEGRATION`	boolean	`true`
`quarkus.langchain4j.jlama.log-requests` Whether Jlama should log requests Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA_LOG_REQUESTS`	boolean	`false`
`quarkus.langchain4j.jlama.log-responses` Whether Jlama client should log responses Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA_LOG_RESPONSES`	boolean	`false`
Named model config	Type	Default
`quarkus.langchain4j.jlama."model-name".chat-model.model-name` Model name to use Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__CHAT_MODEL_MODEL_NAME`	string	`tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4`
`quarkus.langchain4j.jlama."model-name".embedding-model.model-name` Model name to use Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__EMBEDDING_MODEL_MODEL_NAME`	string	`intfloat/e5-small-v2`
`quarkus.langchain4j.jlama."model-name".chat-model.temperature` What sampling temperature to use, between 0.0 and 1.0. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. It is generally recommended to set this or the `top-k` property but not both. Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__CHAT_MODEL_TEMPERATURE`	double	`0.3f`
`quarkus.langchain4j.jlama."model-name".chat-model.max-tokens` The maximum number of tokens to generate in the completion. The token count of your prompt plus `max_tokens` cannot exceed the model’s context length Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__CHAT_MODEL_MAX_TOKENS`	int
`quarkus.langchain4j.jlama."model-name".enable-integration` Whether to enable the integration. Set to `false` to disable all requests. Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__ENABLE_INTEGRATION`	boolean	`true`
`quarkus.langchain4j.jlama."model-name".log-requests` Whether Jlama should log requests Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__LOG_REQUESTS`	boolean	`false`
`quarkus.langchain4j.jlama."model-name".log-responses` Whether Jlama client should log responses Environment variable: `QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__LOG_RESPONSES`	boolean	`false`

Configuration property

Type

Default

quarkus.langchain4j.jlama.include-models-in-artifact

Determines whether the necessary Jlama models are downloaded and included in the jar at build time. Currently, this option is only valid for fast-jar deployments.

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_INCLUDE_MODELS_IN_ARTIFACT

boolean

true

quarkus.langchain4j.jlama.chat-model.enabled

Whether the model should be enabled

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_CHAT_MODEL_ENABLED

boolean

true

quarkus.langchain4j.jlama.embedding-model.enabled

Whether the model should be enabled

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_EMBEDDING_MODEL_ENABLED

boolean

true

quarkus.langchain4j.jlama.chat-model.model-name

Model name to use

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_CHAT_MODEL_MODEL_NAME

string

tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4

quarkus.langchain4j.jlama.embedding-model.model-name

Model name to use

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_EMBEDDING_MODEL_MODEL_NAME

string

intfloat/e5-small-v2

quarkus.langchain4j.jlama.models-path

Location on the file-system which serves as a cache for the models

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_MODELS_PATH

path

${user.home}/.langchain4j/models

quarkus.langchain4j.jlama.chat-model.temperature

What sampling temperature to use, between 0.0 and 1.0. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

It is generally recommended to set this or the top-k property but not both.

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_CHAT_MODEL_TEMPERATURE

double

0.3f

quarkus.langchain4j.jlama.chat-model.max-tokens

The maximum number of tokens to generate in the completion.

The token count of your prompt plus max_tokens cannot exceed the model’s context length

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_CHAT_MODEL_MAX_TOKENS

int

quarkus.langchain4j.jlama.enable-integration

Whether to enable the integration. Set to false to disable all requests.

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_ENABLE_INTEGRATION

boolean

true

quarkus.langchain4j.jlama.log-requests

Whether Jlama should log requests

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_LOG_REQUESTS

boolean

false

quarkus.langchain4j.jlama.log-responses

Whether Jlama client should log responses

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA_LOG_RESPONSES

boolean

false

Named model config

Type

Default

quarkus.langchain4j.jlama."model-name".chat-model.model-name

Model name to use

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__CHAT_MODEL_MODEL_NAME

string

tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4

quarkus.langchain4j.jlama."model-name".embedding-model.model-name

Model name to use

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__EMBEDDING_MODEL_MODEL_NAME

string

intfloat/e5-small-v2

quarkus.langchain4j.jlama."model-name".chat-model.temperature

What sampling temperature to use, between 0.0 and 1.0. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

It is generally recommended to set this or the top-k property but not both.

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__CHAT_MODEL_TEMPERATURE

double

0.3f

quarkus.langchain4j.jlama."model-name".chat-model.max-tokens

The maximum number of tokens to generate in the completion.

The token count of your prompt plus max_tokens cannot exceed the model’s context length

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__CHAT_MODEL_MAX_TOKENS

int

quarkus.langchain4j.jlama."model-name".enable-integration

Whether to enable the integration. Set to false to disable all requests.

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__ENABLE_INTEGRATION

boolean

true

quarkus.langchain4j.jlama."model-name".log-requests

Whether Jlama should log requests

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__LOG_REQUESTS

boolean

false

quarkus.langchain4j.jlama."model-name".log-responses

Whether Jlama client should log responses

Environment variable: QUARKUS_LANGCHAIN4J_JLAMA__MODEL_NAME__LOG_RESPONSES

boolean

false