Using Streamed Responses with Quarkus LangChain4j

Streamed responses allow large language models to return partial answers as they are generated. It significantly improves latency and responsiveness for end users. With Quarkus LangChain4j, you can integrate streaming via REST (SSE) or WebSockets, leveraging Multi<String> for reactive, non-blocking processing.

This guide shows how to:

Define AI services that return streamed responses
Implement both SSE and WebSocket endpoints
Test your application using curl and wscat

Why Use Streamed Responses?

Traditional AI services generate the entire response before returning it, which can lead to:

Perceived latency (long pause before the first word appears)
Higher memory usage (especially for long completions)

Streaming addresses this by sending tokens as they are produced. Benefits include:

Better user experience (progressive rendering)
Reduced memory pressure on both server and client
Easier integration with frontend frameworks (chat bots, dashboards)

Project Setup

Add the following dependencies in your pom.xml:

<!-- Or any other model provider that supports streaming, such as OpenAI -->
<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-ollama</artifactId>
    <version>1.4.0</version>
</dependency>
<dependency>
    <groupId>io.quarkus</groupId>
    <artifactId>quarkus-rest-jackson</artifactId>
</dependency>
<dependency>
    <groupId>io.quarkus</groupId>
    <artifactId>quarkus-websockets-next</artifactId>
</dependency>

IF you are using Ollama, configure your model in application.properties:

quarkus.langchain4j.ollama.chat-model.model-name=qwen3:1.7b

Streamed Responses in AI Services

To enable streaming, your AI service method must return a Multi<String> or a Multi<ChatEvent>. When using a Multi<String> each emitted item represents a token or part of the final response. When using a Multi<ChatEvent> each emitted item is a subclass of ChatEvent with different event types:

PartialResponseEvent - A partial text token from the main response
PartialThinkingEvent - A partial thinking/reasoning text chunk (experimental)
BeforeToolExecutionEvent - Emitted before a tool is executed
ToolExecutedEvent - Tool execution metadata
ContentFetchedEvent - Content retrieved from RAG sources
IntermediateResponseEvent - An intermediate response during multi-turn interactions with tool execution
ChatCompletedEvent - The final LLM response

When tools are used, the event flow is: IntermediateResponseEvent → BeforeToolExecutionEvent → (tool execution) → ToolExecutedEvent → (recursive LLM call) → ChatCompletedEvent. The IntermediateResponseEvent captures the AI’s thinking and tool execution requests before tools are executed.

@RegisterAiService
@SystemMessage("You are a helpful AI assistant. Be concise and to the point.")
public interface StreamedAssistant {

    @UserMessage("Answer the question: {question}")
    Multi<String> respondToQuestion(String question);

    @UserMessage("Answer the question: {question}")
    Multi<ChatEvent> respondToQuestionUsingChatEvent(String question);

}

Quarkus uses Mutiny under the hood. In Quarkus, methods returning Multi are considered non-blocking. Do not use blocking code inside streaming pipelines. For details, refer to the Quarkus Reactive Architecture.

Streaming with Server-Sent Events (SSE)

SSE is a simple way to stream text over HTTP. Let’s expose an endpoint returning Multi<String> as an event stream:

@Path("/stream")
public class Endpoint {

    @Inject StreamedAssistant assistant;

    @POST
    @Produces(MediaType.SERVER_SENT_EVENTS)
    public Multi<String> stream(String question) {
        return assistant.respondToQuestion(question);
    }
}

Run the application (mvn quarkus:dev), then use curl:

curl -N -X POST http://localhost:8080/stream -d "Why is the sky blue?" \
  -H "Content-Type: text/plain"

The -N option disables buffering so you see the stream as it arrives. You’ll receive a stream of tokens, each appearing as a new line in the terminal.

Streaming with WebSockets

For more interactive use cases (chat UIs, dashboards), you can expose a WebSocket endpoint using Quarkus WebSockets.Next.

@WebSocket(path = "/ws/stream")
public class WebSocketEndpoint {

    @Inject WSStreamedAssistant assistant;

    @OnTextMessage
    public Multi<String> onTextMessage(String question) {
        return assistant.respondToQuestion(question);
    }
}

To manage state across messages (local message history), annotate the AI service with @SessionScoped:

@RegisterAiService
@SystemMessage("You are a helpful AI assistant. Be concise and to the point.")
@SessionScoped
public interface WSStreamedAssistant {

    @UserMessage("Answer the question: {question}")
    Multi<String> respondToQuestion(String question);
}

Install a WebSocket client like wscat:

npm install -g wscat

Connect and send a message:

wscat -c ws://localhost:8080/ws/stream
> Why is swimming pool water blue?

You’ll see a token stream printed as separate lines in real-time.

Summary

Use Multi<String> in your AI services to enable streaming
Streaming improves user experience and scalability
SSE offers a simple HTTP-based solution
WebSockets provide a more interactive and stateful option