Hugging Face
Hugging Face is a leading platform in the field of natural language processing (NLP) that provides a comprehensive collection of pre-trained language models. Hugging Face facilitates easy access to a wide range of state-of-the-art models for various NLP tasks. Its focus on democratizing access to cutting-edge NLP capabilities has made Hugging Face a pivotal player in the advancement of language technology.
Using Hugging Face models
To employ Hugging Face LLMs, integrate the following dependency into your project:
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-hugging-face</artifactId>
<version>0.24.0.CR1</version>
</dependency>
If no other LLM extension is installed, AI Services will automatically utilize the configured Hugging Face model.
Hugging Face provides multiple kind of models. We only support text-to-text models, which are models that take a text as input and return a text as output. |
By default, the extension uses:
-
tiiuae/falcon-7b-instruct as chat model (inference endpoint: https://api-inference.huggingface.co/models/tiiuae/falcon-7b-instruct)
-
sentence-transformers/all-MiniLM-L6-v2 as embedding model (inference endpoint: https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-MiniLM-L6-v2)
Configuration
Configuring Hugging Face models mandates an API key, obtainable by creating an account on the Hugging Face platform.
The API key can be set in the application.properties
file:
quarkus.langchain4j.huggingface.api-key=hf-...
Alternatively, leverage the QUARKUS_LANGCHAIN4J_HUGGINGFACE_API_KEY environment variable.
|
Several configuration properties are available:
Configuration property fixed at build time - All other configuration properties are overridable at runtime
Type |
Default |
|
---|---|---|
Whether the model should be enabled Environment variable: |
boolean |
|
Whether the model should be enabled Environment variable: |
boolean |
|
Whether the model should be enabled Environment variable: |
boolean |
|
HuggingFace API key Environment variable: |
string |
|
Timeout for HuggingFace calls Environment variable: |
|
|
The URL of the inference endpoint for the chat model. When using Hugging Face with the inference API, the URL is When using a deployed inference endpoint, the URL is the URL of the endpoint. When using a local hugging face model, the URL is the URL of the local model. Environment variable: |
|
|
Float (0.0-100.0). The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score, 100.0 is getting closer to uniform probability Environment variable: |
double |
|
Int (0-250). The amount of new tokens to be generated, this does not include the input length it is a estimate of the size of generated text you want. Each new tokens slows down the request, so look for balance between response times and length of text generated Environment variable: |
int |
|
If set to Environment variable: |
boolean |
|
If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done. It is advised to only set this flag to true after receiving a 503 error as it will limit hanging in your application to known places Environment variable: |
boolean |
|
Whether or not to use sampling ; use greedy decoding otherwise. Environment variable: |
boolean |
|
The number of highest probability vocabulary tokens to keep for top-k-filtering. Environment variable: |
int |
|
If set to less than Environment variable: |
double |
|
The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. Environment variable: |
double |
|
Whether chat model requests should be logged Environment variable: |
boolean |
|
Whether chat model responses should be logged Environment variable: |
boolean |
|
The URL of the inference endpoint for the embedding. When using Hugging Face with the inference API, the URL is When using a deployed inference endpoint, the URL is the URL of the endpoint. When using a local hugging face model, the URL is the URL of the local model. Environment variable: |
||
If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done. It is advised to only set this flag to true after receiving a 503 error as it will limit hanging in your application to known places Environment variable: |
boolean |
|
Whether the HuggingFace client should log requests Environment variable: |
boolean |
|
Whether the HuggingFace client should log responses Environment variable: |
boolean |
|
Whether or not to enable the integration. Defaults to Environment variable: |
boolean |
|
Type |
Default |
|
HuggingFace API key Environment variable: |
string |
|
Timeout for HuggingFace calls Environment variable: |
|
|
The URL of the inference endpoint for the chat model. When using Hugging Face with the inference API, the URL is When using a deployed inference endpoint, the URL is the URL of the endpoint. When using a local hugging face model, the URL is the URL of the local model. Environment variable: |
|
|
Float (0.0-100.0). The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score, 100.0 is getting closer to uniform probability Environment variable: |
double |
|
Int (0-250). The amount of new tokens to be generated, this does not include the input length it is a estimate of the size of generated text you want. Each new tokens slows down the request, so look for balance between response times and length of text generated Environment variable: |
int |
|
If set to Environment variable: |
boolean |
|
If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done. It is advised to only set this flag to true after receiving a 503 error as it will limit hanging in your application to known places Environment variable: |
boolean |
|
Whether or not to use sampling ; use greedy decoding otherwise. Environment variable: |
boolean |
|
The number of highest probability vocabulary tokens to keep for top-k-filtering. Environment variable: |
int |
|
If set to less than Environment variable: |
double |
|
The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. Environment variable: |
double |
|
Whether chat model requests should be logged Environment variable: |
boolean |
|
Whether chat model responses should be logged Environment variable: |
boolean |
|
The URL of the inference endpoint for the embedding. When using Hugging Face with the inference API, the URL is When using a deployed inference endpoint, the URL is the URL of the endpoint. When using a local hugging face model, the URL is the URL of the local model. Environment variable: |
||
If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done. It is advised to only set this flag to true after receiving a 503 error as it will limit hanging in your application to known places Environment variable: |
boolean |
|
Whether the HuggingFace client should log requests Environment variable: |
boolean |
|
Whether the HuggingFace client should log responses Environment variable: |
boolean |
|
Whether or not to enable the integration. Defaults to Environment variable: |
boolean |
|
About the Duration format
To write duration values, use the standard You can also use a simplified format, starting with a number:
In other cases, the simplified format is translated to the
|
Configuring the chat model
You can change the chat model by setting the quarkus.langchain4j.huggingface.chat-model.inference-endpoint-url
property.
When using a model hosted on Hugging Face, the property should be set to: https://api-inference.huggingface.co/models/<model-id>
.
For example, to use the google/flan-t5-small
model, set:
quarkus.langchain4j.huggingface.chat-model.inference-endpoint-url=https://api-inference.huggingface.co/models/google/flan-t5-small
Remember that only text to text models are supported.
Using inference endpoints and local models
Hugging Face models can be deployed to provide inference endpoints.
In this case, configure the quarkus.langchain4j.huggingface.inference-endpoint-url
property to point to the endpoint URL:
quarkus.langchain4j.huggingface.chat-model.inference-endpoint-url=https://j9dkyuliy170f3ia.us-east-1.aws.endpoints.huggingface.cloud
If you run a model locally, adapt the URL accordingly:
quarkus.langchain4j.huggingface.chat-model.inference-endpoint-url=http://localhost:8085
Document Retriever and Embedding
When utilizing Hugging Face models, the recommended practice involves leveraging the EmbeddingModel
provided by Hugging Face.
-
If no other LLM extension is installed, retrieve the embedding model as follows:
@Inject EmbeddingModel model; // Injects the embedding model
You can configure the model using:
quarkus.langchain4j.huggingface.embedding-model.inference-endpoint-url=https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-MiniLM-L6-v2
Not every sentence transformers are supported by the embedding model. If you want to use a custom sentence transformers, you need to create your own embedding model. |