Testing AI-Infused Applications
The quarkus-langchain4j-testing-evaluation-junit5 extension provides a pragmatic and extensible testing framework for evaluating AI-infused and agentic applications.
It integrates with JUnit 5, @QuarkusTest and offers tools for automating evaluation processes, scoring outputs, and generating evaluation reports using customizable evaluation strategies.
You can use this testing framework:
-
using the API directly - to programmatically define and run evaluations in your tests or even applications
-
using JUnit 5 extension - to declaratively define evaluation tests with annotations
Features
The evaluation framework provides:
Core Capabilities:
-
Concurrent Evaluation - Execute evaluations in parallel with configurable concurrency
-
Fluent Builder API - Readable, chainable evaluation definitions (samples, evaluation strategy, reporting)
-
Rich Evaluation Results - Scores, explanations, and metadata for detailed analysis
-
Multiple Evaluation Strategies - Semantic similarity, AI judge, and custom strategies
-
Flexible Sample Loading - YAML, JSON, or custom formats via SPI or CDI
Testing Approaches:
-
Programmatic Testing - Traditional programmatic test definitions with
Scorer -
Declarative Testing -
@EvaluationTestfor annotation-driven test configuration -
Test Templates -
@StrategyTestto run tests against multiple strategies -
Multiple Sample Sources -
@SampleSourcesto combine samples from different files
Assertions and Reporting:
-
Fluent Assertions -
EvaluationAssertionsfor readable test verification -
Readable Test Names -
EvaluationDisplayNameGeneratorshows scores and pass/fail status -
Multi-Format Reports - Generate Markdown, JSON, or custom format reports
Extensibility:
-
Custom Strategies - Implement
EvaluationStrategyfor domain-specific evaluation logic -
Custom Sample Loaders - Support any sample format via
SampleLoaderinterface -
Custom Report Formatters - Generate reports in any format via
ReportFormatterinterface
Maven Dependency
To use the EvaluationExtension, include the following Maven dependency in your pom.xml:
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-testing-evaluation-junit5</artifactId>
<scope>test</scope>
</dependency>
Using the extension
To use the extension, annotate your test class with @ExtendWith(EvaluationExtension.class) or @Evaluate:
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import io.quarkiverse.langchain4j.evaluation.junit5.EvaluationExtension;
@ExtendWith(EvaluationExtension.class)
public class MyScorerTests {
// Test cases go here
}
Or, you can use the @Evaluate annotation:
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import io.quarkiverse.langchain4j.evaluation.junit5.Evaluate;
@Evaluate
public class MyScorerTests {
// Test cases go here
}
This Junit 5 extension can be combined with @QuarkusTest to test Quarkus applications:
import io.quarkus.test.junit.QuarkusTest;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import io.quarkiverse.langchain4j.evaluation.junit5.Evaluate;
@QuarkusTest
@Evaluate
public class MyScorerTests {
// Test cases go here
}
Concepts
Evaluation
An Evaluation is the process of testing whether an AI system’s output meets expectations for a given input. Unlike traditional unit tests that check for exact matches, evaluations assess AI outputs using flexible criteria such as semantic similarity, correctness, or custom domain-specific logic.
In the context of Quarkus Langchain4J, an evaluation consists of:
-
A Sample - The test case containing input parameters and expected output (and optional metadata like tags)
-
A Function Under Test - The AI service, model, or function being evaluated, it receives the input parameters and produce an actual output
-
An Evaluation Strategy - The logic that determines if the actual output is acceptable based on the expected output
-
An Evaluation Result - The outcome containing pass/fail status, optional score, explanation, and metadata
The evaluation process works as follows:
For example, testing a customer support chatbot:
-
Sample: Input = "What are your business hours?", Expected = "We’re open Monday-Friday, 9am-5pm"
-
Function: The chatbot’s response generation (e.g.,
chatbot.chat("What are your business hours?")) -
Strategy: Semantic similarity (≥80% match), AI judge, or custom logic
-
Result: Passed (score=0.92, explanation="Response conveys the same business hours information")
This approach recognizes that AI outputs are often non-deterministic and may vary in phrasing while remaining correct. It also provides flexibility to define evaluation criteria that align with application requirements.
Scorer
The Scorer (io.quarkiverse.langchain4j.testing.evaluation.Scorer) is the cornerstone component that orchestrates evaluations for a set of samples (represented by io.quarkiverse.langchain4j.testing.evaluation.Samples) against a function (part of the application) and evaluation strategies.
It can run evaluations concurrently and produces an EvaluationReport summarizing the results and providing an aggregate score.
The produced score is the percentage of passed evaluations (between 0.0 and 100.0). It is calculated as the ratio of the number of passed evaluations to the total number of evaluations.
In general, tests using the Scorer follow this pattern:
// The AI Service to test, or whatever entry point of your application
@Inject CustomerSupportAssistant assistant;
@Test
void testAiService(
// The scorer instance, with concurrency set to 5
@ScorerConfiguration(concurrency = 5) Scorer scorer,
// The samples loaded from a YAML file, or custom loader
@SampleLocation("src/test/resources/samples.yaml") Samples<String> samples) {
// Define the function that will be evaluated
// The parameters comes from the sample
// The output of this function will be compared to the expected output in the samples
Function<Parameters, String> function = parameters -> assistant.chat(parameters.get(0));
EvaluationReport report = scorer.evaluate(samples, function,
new SemanticSimilarityStrategy(0.8)); // The evaluation strategy
assertThat(report.score()).isGreaterThanOrEqualTo(70); // Assert the score
}
Fluent Builder API
The framework provides a fluent builder API through the Evaluation class for more readable evaluation definitions:
import static io.quarkiverse.langchain4j.testing.evaluation.EvaluationAssertions.assertThat;
@Test
void evaluateWithFluentBuilder() {
EvaluationReport<String> report = Evaluation.<String>builder()
.withSamples("src/test/resources/customer-support-samples.yaml")
.evaluate(params -> bot.chat(params.get(0)))
.using(new SemanticSimilarityStrategy(0.80))
.run();
assertThat(report)
.hasScoreGreaterThan(75.0)
.hasAtLeastPassedEvaluations(4);
}
When using this API, the scorer is created automatically.
The fluent API also supports filtering by tags:
@Test
void evaluateFilteredByTags() {
// Evaluate only samples with specific tags
EvaluationReport<String> report = Evaluation.<String>builder()
.withSamples("src/test/resources/customer-support-samples.yaml")
.withTags("critical") // Only evaluate critical samples
.evaluate(params -> bot.chat(params.get(0)))
.using(new SemanticSimilarityStrategy(0.75))
.run();
assertThat(report)
.hasScoreGreaterThan(70.0)
.hasAtLeastPassedEvaluations(2);
}
You can apply multiple strategies to the same samples:
@Test
void evaluateWithMultipleStrategies() {
// Apply multiple strategies to the same samples
EvaluationReport<String> report = Evaluation.<String>builder()
.withSamples("src/test/resources/smoke-tests.yaml")
.withConcurrency(1)
.evaluate(params -> bot.chat(params.get(0)))
.using(new SemanticSimilarityStrategy(0.85))
.using(new SemanticSimilarityStrategy(0.75))
.run();
assertThat(report).hasScoreGreaterThan(60.0);
}
The fluent API supports deferred executions, allowing you to build evaluation definitions and execute them later:
@Test
void deferredExecution() {
// Build the evaluation definition
EvaluationRunner<String> runner = Evaluation.<String>builder()
.withSamples("src/test/resources/smoke-tests.yaml")
.evaluate(params -> bot.chat(params.get(0)))
.using(new SemanticSimilarityStrategy(0.80))
.build();
// Execute later (can be called multiple times)
EvaluationReport<String> report1 = runner.run();
EvaluationReport<String> report2 = runner.run();
assertThat(report1).hasScoreGreaterThan(60.0);
assertThat(report2).hasScoreGreaterThan(60.0);
}
When multiple strategies are added, each sample is evaluated against all strategies, and results are aggregated in the report.
Samples
A Sample (io.quarkiverse.langchain4j.testing.evaluation.EvaluationSample) represents a single input-output test case.
It includes:
-
a name: the name of the sample,
-
the parameters: the parameter data for the test (input),
-
the expected output: the expected result that will be evaluated,
-
the tags: metadata that can categorize the sample for targeted evaluation (tags are optional).
When tags are set, the score can be calculated per tag (in addition to the global score).
A list of samples is represented by Samples (io.quarkiverse.langchain4j.testing.evaluation.Samples).
Samples can be defined using a builder pattern:
var s1 = EvaluationSample.<String>builder()
.withName("sample1")
.withParameter("value1")
.withExpectedOutput("my expected result2")
.build();
var s2 = EvaluationSample.<String>builder()
.withName("sample2")
.withParameter("value2")
.withExpectedOutput("my expected results")
.build();
Samples<String> samples = new Samples<>(List.of(s1, s2));
Alternatively, samples can be loaded from a YAML file using the @SampleLocation annotation:
- name: Sample1
parameters:
- "parameter1"
expected-output: "expected1"
tags: ["tag1"]
- name: Sample2
parameters:
- "parameter2"
expected-output: "expected2"
tags: ["tag1"]
| You can implement your own loader to support custom sample formats (see the Custom Sample Loaders section). |
Evaluation Strategy
An EvaluationStrategy (io.quarkiverse.langchain4j.testing.evaluation.EvaluationStrategy) defines how to evaluate a sample.
The framework includes ready-to-use strategies (detailed below), and you can implement custom ones.
/**
* A strategy to evaluate the output of a model.
* @param <T> the type of the output.
*/
public interface EvaluationStrategy<T> {
/**
* Evaluate the output of a model.
* @param sample the sample to evaluate.
* @param output the output of the model.
* @return an EvaluationResult with pass/fail status, optional score, explanation, and metadata.
*/
EvaluationResult evaluate(EvaluationSample<T> sample, T output);
}
Strategies return an EvaluationResult which can be created using:
-
EvaluationResult.passed()/EvaluationResult.failed(): Simple pass/fail -
EvaluationResult.fromBoolean(boolean): Convert boolean to result -
EvaluationResult.withScore(double): Include a numerical score (0.0-1.0) -
Fluent API for complex results with explanations and metadata:
var result = EvaluationResult
.passed(0.95)
.withExplanation("Response semantically matches expected output")
.withMetadata(Map.of("similarityScore", 0.95, "method", "cosine"));
Evaluation Report
The EvaluationReport aggregates the results of all evaluations.
It provides:
-
A global score (percentage of passed evaluations, 0.0-100.0)
-
Scores per tag (for categorized evaluations)
-
Access to individual evaluation results with scores, explanations, and metadata
-
Report generation in multiple formats (Markdown, JSON, or custom formats)
Fluent Assertions for Reports
The framework provides EvaluationAssertions for readable test assertions:
import static io.quarkiverse.langchain4j.testing.evaluation.EvaluationAssertions.assertThat;
assertThat(report)
.hasScore(80.0) // Exact score
.hasScoreGreaterThan(70.0) // Minimum score
.hasScoreBetween(70.0, 90.0) // Score range
.hasPassedCount(8) // Exact passed count
.hasAtLeastPassedEvaluations(4) // Minimum passed
.hasFailedCount(2) // Exact failed count
.hasAtMostFailedEvaluations(3) // Maximum failed
.hasEvaluationCount(10); // Total count
Writing Evaluation Test
You can write evaluation tests in multiple ways:
-
programmatic using a
scorerobject directly injected in a field or as a method parameter -
declarative using the
@EvaluationTestannotation
Example Test Using Field Injection (programmatic)
@ExtendWith(EvaluationExtension.class)
public class ScorerFieldInjectionTest {
@ScorerConfiguration(concurrency = 4)
private Scorer scorer;
@Test
void evaluateSamples() {
// Define test samples
Samples<String> samples = new Samples<>(
EvaluationSample.<String>builder().withName("Sample1").withParameter("p1").withExpectedOutput("expected1").build(),
EvaluationSample.<String>builder().withName("Sample2").withParameter("p2").withExpectedOutput("expected2").build()
);
// Define evaluation strategies
EvaluationStrategy<String> strategy = new SemanticSimilarityStrategy(0.85);
// Evaluate samples
EvaluationReport report = scorer.evaluate(samples, parameters -> {
// Replace with your function under test
return "actualOutput";
}, strategy);
// Assert results
assertThat(report.score()).isGreaterThan(50.0);
}
}
Example Test Using Parameter Injection (programmatic)
@ExtendWith(EvaluationExtension.class)
public class ScorerParameterInjectionTest {
// ....
@Test
void evaluateWithInjectedScorer(
@ScorerConfiguration(concurrency = 2) Scorer scorer,
@SampleLocation("test-samples.yaml") Samples<String> samples
) {
// Use an evaluation strategy
EvaluationStrategy<String> strategy = new AiJudgeStrategy(myChatLanguageModel);
// Evaluate samples
EvaluationReport report = scorer.evaluate(samples, parameters -> {
// Replace with your function under test
return "actualOutput";
}, strategy);
// Assert results
assertThat(report).hasScoreGreaterThan(50.0);
}
}
Declarative Testing with @EvaluationTest
The @EvaluationTest annotation provides a declarative way to define evaluation tests without writing explicit test methods:
@QuarkusTest
@Evaluate
@DisplayNameGeneration(EvaluationDisplayNameGenerator.class)
public class DeclarativeEvaluationTest {
@Inject
CustomerSupportBot bot;
/**
* Define a reusable evaluation function.
* This function will be referenced by name in the test annotations.
*/
@EvaluationFunction("chatbot")
public Function<Parameters, String> chatbotFunction() {
return params -> bot.chat(params.get(0));
}
/**
* Declarative test using @EvaluationTest.
* The framework automatically loads samples, evaluates them,
* and asserts the minimum score.
*/
@EvaluationTest(
samples = "smoke-tests.yaml",
strategy = SemanticSimilarityStrategy.class,
function = "chatbot",
minScore = 70.0
)
void smokeTestsWithSemanticSimilarity() {
// Test body can be empty - evaluation happens automatically
// The test will fail if score is below 70%
}
/**
* Another @EvaluationTest with a different configuration.
*/
@EvaluationTest(
samples = "customer-support-samples.yaml",
strategy = AiJudgeStrategy.class,
function = "chatbot",
minScore = 85.0
)
void criticalCustomerSupportEvaluation() {
// Higher threshold for critical evaluations
}
}
When using @EvaluationTest, the framework automatically:
-
Loads samples from the specified location
-
Injects the evaluation function by name
-
Applies the specified evaluation strategy (the instance is created using a no-arg constructor)
-
Evaluates the samples and generates a report
Test Templates with @StrategyTest
The @StrategyTest annotation enables testing the same function against multiple evaluation strategies:
@QuarkusTest
@Evaluate
@DisplayNameGeneration(EvaluationDisplayNameGenerator.class)
public class DeclarativeEvaluationTest {
@Inject
CustomerSupportBot bot;
/**
* Test using multiple strategies with @StrategyTest.
* The test runs once for each strategy.
*/
@StrategyTest(strategies = {
SemanticSimilarityStrategy.class,
AiJudgeStrategy.class,
})
void customerSupportWithMultipleStrategies(
@SampleLocation("src/test/resources/smoke-tests.yaml") Samples<String> samples,
EvaluationStrategy<String> strategy,
Scorer scorer) {
// This test method will execute twice:
// 1. Once with SemanticSimilarityStrategy
// 2. Once with AiJudgeStrategy
// Each execution appears as a separate test in the results
var report = scorer.evaluate(
samples,
params -> bot.chat(params.get(0)),
strategy);
System.out.printf("Strategy %s - Score: %.2f%%%n",
strategy.getClass().getSimpleName(),
report.score());
assertThat(report).hasScoreGreaterThan(60.0);
}
}
The customerSupportWithMultipleStrategies method is invoked once for each specified strategy, allowing you to compare performance across different evaluation approaches.
Loading Samples from Multiple Sources
When you need to combine samples from multiple files or sources, use the @SampleSources annotation. This is useful for:
-
Combining different test suites (e.g., smoke tests + regression tests)
-
Organizing samples by category or feature in separate files
-
Reusing common sample sets across different tests
Basic Usage
The @SampleSources annotation accepts an array of @SampleLocation annotations and merges all samples into a single Samples instance:
@QuarkusTest
@Evaluate
public class CombinedEvaluationTest {
@Inject
CustomerSupportBot bot;
@Test
void evaluateWithMultipleSampleSources(
@SampleSources({
@SampleLocation("src/test/resources/smoke-tests.yaml"),
@SampleLocation("src/test/resources/regression-tests.yaml"),
@SampleLocation("src/test/resources/edge-cases.yaml")
}) Samples<String> samples,
@ScorerConfiguration(concurrency = 3) Scorer scorer) {
// All samples from all three files are combined
EvaluationReport<String> report = scorer.evaluate(
samples,
params -> bot.chat(params.get(0)),
new SemanticSimilarityStrategy(0.85));
assertThat(report)
.hasScoreGreaterThan(80.0)
.hasAtLeastPassedEvaluations(4);
}
}
Sample Order Preservation
Samples from multiple sources are combined in the order they appear in the @SampleSources array. Within each source, the original order is preserved:
@Test
void samplesAreOrderedCorrectly(
@SampleSources({
@SampleLocation("src/test/resources/smoke-tests.yaml"), // First
@SampleLocation("src/test/resources/regression-tests.yaml") // Second
}) Samples<String> samples) {
var sampleNames = samples.stream()
.map(sample -> sample.name())
.toList();
// Samples from smoke-tests.yaml come first, then regression-tests.yaml
assertThat(sampleNames).containsExactly(
"Smoke Test 1",
"Smoke Test 2",
"Regression Test 1",
"Regression Test 2"
);
}
Filtering Combined Samples by Tags
You can filter combined samples by tags, which is useful when you want to run specific subsets of your combined sample set:
@Test
void evaluateOnlyCriticalSamples(
@SampleSources({
@SampleLocation("src/test/resources/smoke-tests.yaml"),
@SampleLocation("src/test/resources/regression-tests.yaml")
}) Samples<String> samples,
Scorer scorer) {
// Filter to only evaluate samples tagged as "critical"
var criticalSamples = samples.filterByTags("critical");
var report = scorer.evaluate(
criticalSamples,
params -> bot.chat(params.get(0)),
new SemanticSimilarityStrategy(0.90));
assertThat(report)
.hasPassedForTag("critical")
.hasScoreGreaterThan(95.0);
}
Use with @StrategyTest
@SampleSources can be combined with @StrategyTest to test multiple sample sources against multiple strategies:
@StrategyTest(strategies = {
SemanticSimilarityStrategy.class,
AiJudgeStrategy.class
})
void evaluateMultipleSourcesWithMultipleStrategies(
@SampleSources({
@SampleLocation("src/test/resources/smoke-tests.yaml"),
@SampleLocation("src/test/resources/regression-tests.yaml")
}) Samples<String> samples,
EvaluationStrategy<String> strategy,
Scorer scorer) {
// This test runs 2 times (once per strategy)
// Each execution uses all combined samples
var report = scorer.evaluate(
samples,
params -> bot.chat(params.get(0)),
strategy);
assertThat(report).hasScoreGreaterThan(70.0);
}
All samples from all specified sources are merged into a single Samples instance, so filtering, evaluation, and reporting work seamlessly across the combined set.
|
Readable Test Names with EvaluationDisplayNameGenerator
You can produce readable test display names showing sample names, scores, and pass/fail status:
@QuarkusTest
@Evaluate
@DisplayNameGeneration(EvaluationDisplayNameGenerator.class)
public class DeclarativeEvaluationTest {
@Inject
CustomerSupportBot bot;
@EvaluationFunction("chatbot")
public Function<Parameters, String> chatbotFunction() {
return params -> bot.chat(params.get(0));
}
@EvaluationTest(
samples = "smoke-tests.yaml",
strategy = SemanticSimilarityStrategy.class,
function = "chatbot",
minScore = 70.0
)
void smokeTestsWithSemanticSimilarity() {
// Test names will show sample results with scores
}
}
Test Suite Reporting
Generate aggregate reports across all tests using @ReportConfiguration:
@QuarkusTest
@Evaluate
@ExtendWith(SuiteEvaluationReporter.class) // Enable test suite reporting
public class SuiteReportingTest {
@ReportConfiguration(
outputDir = "target",
fileName = "test-suite-report",
formats = {"markdown", "json"},
includeDetails = true
)
private EvaluationReport<String> smokeTestReport;
@Test
void smokeTests(@ScorerConfiguration(concurrency = 5) Scorer scorer,
@SampleLocation("smoke-tests.yaml") Samples<String> samples) {
smokeTestReport = scorer.evaluate(samples,
params -> myService.process(params.get(0)),
new SemanticSimilarityStrategy(0.8));
// Register report for test suite aggregation
ReportRegistry.registerReport(
getClass().getSimpleName(),
"smokeTestReport",
smokeTestReport);
}
}
This generates:
-
target/evaluation-suite-report.md: Aggregated test suite report -
target/evaluation-suite-report.json: Aggregated test suite JSON report
Built-in Evaluation Strategies
Semantic Similarity
The SemanticSimilarityStrategy (io.quarkiverse.langchain4j.testing.evaluation.similarity.SemanticSimilarityStrategy) evaluates the similarity between the actual output and the expected output using cosine similarity.
It requires an embedding model and a minimum similarity threshold.
Maven Dependency:
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-testing-evaluation-semantic-similarity</artifactId>
<scope>test</scope>
</dependency>
Examples:
EvaluationStrategy<String> strategy = new SemanticSimilarityStrategy(0.9);
EvaluationStrategy<String> strategy2 = new SemanticSimilarityStrategy(embeddingModel, 0.85);
AI Judge
The AiJudgeStrategy leverages an AI model to determine if the actual output matches the expected output.
It uses a configurable evaluation prompt and ChatModel.
Maven Dependency
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-testing-evaluation-ai-judge</artifactId>
<scope>test</scope>
</dependency>
Example:
EvaluationStrategy<String> strategy = new AiJudgeStrategy(myChatLanguageModel, """
You are an AI evaluating a response and the expected output.
You need to evaluate whether the model response is correct or not.
Return true if the response is correct, false otherwise.
Response to evaluate: {response}
Expected output: {expected_output}
""");
Creating a Custom Evaluation Strategy
To implement your own evaluation strategy, implement the EvaluationStrategy interface:
import io.quarkiverse.langchain4j.testing.evaluation.*;
public class MyCustomStrategy implements EvaluationStrategy<String> {
@Override
public EvaluationResult evaluate(EvaluationSample<String> sample, String output) {
// Custom evaluation logic
boolean matches = output.equalsIgnoreCase(sample.expectedOutput());
return EvaluationResult.fromBoolean(matches);
}
}
Then, use the custom strategy in your test:
EvaluationStrategy<String> strategy = new MyCustomStrategy();
EvaluationReport report = scorer.evaluate(samples, parameters -> {
return "actualOutput";
}, strategy);
Here is an example of a custom strategy that can be used to verify the correctness of a vector search:
public class TextSegmentEvaluationStrategy implements EvaluationStrategy<List<String>> {
@Override
public EvaluationResult evaluate(EvaluationSample<List<String>> sample, List<String> response) {
List<String> expected = sample.expectedOutput();
int found = 0;
for (String seg : expected) {
// Make sure that the response contains the expected segment
boolean segFound = false;
for (String s : response) {
if (s.toLowerCase().contains(seg.toLowerCase())) {
segFound = true;
found++;
break;
}
}
if (!segFound) {
System.out.println("Segment not found: " + seg);
}
}
double score = (double) found / expected.size();
return EvaluationResult.builder()
.passed(found == expected.size())
.score(score)
.explanation(String.format("Found %d of %d expected segments", found, expected.size()))
.build();
}
}
Injecting Samples
You can load samples directly from a YAML file using the @SampleLocation annotation:
- name: Sample1
parameters:
- "value1"
expected-output: "expected1"
tags: ["tag1"]
- name: Sample2
parameters:
- "value2"
expected-output: "expected2"
tags: ["tag2"]
Then, inject the samples into your test method:
@Test
void evaluateWithSamples(@SampleLocation("test-samples.yaml") Samples<String> samples) {
// Use samples in your test
}
Custom Sample Loaders
The framework supports custom sample loaders through a hybrid discovery mechanism using both Java ServiceLoader (SPI) and CDI.
Implementing a Custom Sample Loader
Create a class implementing the SampleLoader interface:
package com.example;
import io.quarkiverse.langchain4j.testing.evaluation.*;
public class JsonSampleLoader implements SampleLoader {
@Override
public boolean supports(String source) {
return source.endsWith(".json");
}
@Override
public <T> Samples<T> load(String source, Class<T> outputType) {
// Load and parse JSON file
// Return Samples<T>
}
}
Registering via ServiceLoader (SPI)
Create META-INF/services/io.quarkiverse.langchain4j.testing.evaluation.SampleLoader:
com.example.JsonSampleLoader
Registering via CDI
Make your loader a CDI bean:
@ApplicationScoped
public class JsonSampleLoader implements SampleLoader {
// Implementation
}
The framework automatically discovers loaders from both sources, with CDI beans taking precedence. CDI is the recommended approach for better integration with Quarkus.
Also, the samples can be loaded from a remote location (HTTP URL, database, etc.) by implementing the custom logic in the load method.
Custom Report Formatters
Create custom report formatters to export evaluation results in different formats.
Implementing a Custom Report Formatter
package com.example;
import io.quarkiverse.langchain4j.testing.evaluation.*;
import java.io.Writer;
import java.util.Map;
public class HtmlReportFormatter implements ReportFormatter {
@Override
public String format() {
return "html";
}
@Override
public String fileExtension() {
return ".html";
}
@Override
public void format(EvaluationReport<?> report, Writer writer, Map<String, Object> config) {
// Generate HTML report
writer.write("<html><body>");
writer.write("<h1>Evaluation Report</h1>");
writer.write("<p>Score: " + report.score() + "%</p>");
// ... more HTML generation
writer.write("</body></html>");
}
}
Registering via ServiceLoader (SPI)
Create META-INF/services/io.quarkiverse.langchain4j.testing.evaluation.ReportFormatter:
com.example.HtmlReportFormatter
Example of tests using Quarkus
Let’s imagine an AI Service used by a Chatbot to generate responses. Let’s also imagine that this AI Service has access to a (RAG) content retriever. The associated tests could be:
package dev.langchain4j.quarkus;
import dev.langchain4j.data.message.UserMessage;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.openai.OpenAiChatModel;
import dev.langchain4j.quarkus.workshop.CustomerSupportAssistant;
import dev.langchain4j.rag.AugmentationRequest;
import dev.langchain4j.rag.RetrievalAugmentor;
import dev.langchain4j.rag.content.Content;
import dev.langchain4j.rag.query.Metadata;
import io.quarkiverse.langchain4j.evaluation.junit5.Evaluate;
import io.quarkiverse.langchain4j.evaluation.junit5.SampleLocation;
import io.quarkiverse.langchain4j.evaluation.junit5.ScorerConfiguration;
import io.quarkiverse.langchain4j.testing.evaluation.EvaluationReport;
import io.quarkiverse.langchain4j.testing.evaluation.EvaluationSample;
import io.quarkiverse.langchain4j.testing.evaluation.EvaluationStrategy;
import io.quarkiverse.langchain4j.testing.evaluation.Parameters;
import io.quarkiverse.langchain4j.testing.evaluation.Samples;
import io.quarkiverse.langchain4j.testing.evaluation.Scorer;
import io.quarkiverse.langchain4j.testing.evaluation.judge.AiJudgeStrategy;
import io.quarkiverse.langchain4j.testing.evaluation.similarity.SemanticSimilarityStrategy;
import io.quarkus.test.junit.QuarkusTest;
import jakarta.enterprise.context.control.ActivateRequestContext;
import jakarta.inject.Inject;
import jakarta.inject.Singleton;
import org.junit.jupiter.api.Test;
import java.util.List;
import java.util.UUID;
import java.util.function.Function;
import static org.assertj.core.api.Assertions.assertThat;
@QuarkusTest
@Evaluate
public class AssistantTest {
// Just a function calling the AI Service and returning the response as a String.
@Inject
AiServiceEvaluation aiServiceEvaluation;
// The content retriever from the RAG pattern I want to test
@Inject
RetrievalAugmentor retriever;
// Test the AI Service using the Semantic Similarity Strategy
@Test
void testAiService(@ScorerConfiguration(concurrency = 5) Scorer scorer,
@SampleLocation("src/test/resources/samples.yaml") Samples<String> samples) {
EvaluationReport report = scorer.evaluate(samples, aiServiceEvaluation,
new SemanticSimilarityStrategy(0.8));
assertThat(report.score()).isGreaterThanOrEqualTo(70);
}
// Test the AI Service using the AI Judge Strategy
@Test
void testAiServiceUsingAiJudge(Scorer scorer,
@SampleLocation("src/test/resources/samples.yaml") Samples<String> samples) {
var judge = OpenAiChatModel.builder()
.baseUrl("http://localhost:11434/v1") // Ollama
.modelName("mistral")
.build();
EvaluationReport report = scorer.evaluate(samples, aiServiceEvaluation,
new AiJudgeStrategy(judge));
assertThat(report.score()).isGreaterThanOrEqualTo(70);
}
// Evaluation strategy can be CDI beans (which means they can easily be injected)
@Inject
TextSegmentEvaluationStrategy textSegmentEvaluationStrategy;
// Test of the RAG retriever
@Test
void testRagRetriever(Scorer scorer, @SampleLocation("src/test/resources/content-retriever-samples.yaml") Samples<List<String>> samples) {
EvaluationReport report = scorer.evaluate(samples, i -> runRetriever(i.get(0)),
textSegmentEvaluationStrategy);
assertThat(report.score()).isEqualTo(100); // Expect full success
}
private List<String> runRetriever(String query) {
UserMessage message = UserMessage.userMessage(query);
AugmentationRequest request = new AugmentationRequest(message,
new Metadata(message, UUID.randomUUID().toString(), List.of()));
var res = retriever.augment(request);
return res.contents().stream().map(Content::textSegment).map(TextSegment::text).toList();
}
@Singleton
public static class AiServiceEvaluation implements Function<Parameters, String> {
@Inject
CustomerSupportAssistant assistant;
@ActivateRequestContext
@Override
public String apply(Parameters params) {
return assistant.chat(UUID.randomUUID().toString(), params.get(0)).collect()
.in(StringBuilder::new, StringBuilder::append).map(StringBuilder::toString).await().indefinitely();
}
}
@Singleton
public static class TextSegmentEvaluationStrategy implements EvaluationStrategy<List<String>> {
@Override
public boolean evaluate(EvaluationSample<List<String>> sample, List<String> response) {
List<String> expected = sample.expectedOutput();
int found = 0;
for (String seg : expected) {
// Make sure that the response contains the expected segment
boolean segFound = false;
for (String s : response) {
if (s.toLowerCase().contains(seg.toLowerCase())) {
segFound = true;
found++;
break;
}
}
if (!segFound) {
System.out.println("Segment not found: " + seg);
}
}
return found == expected.size();
}
}
}
This test class demonstrates how to use the EvaluationExtension to evaluate an AI Service and a RAG retriever using different strategies.
The associated samples are:
---
- name: "car types"
parameters:
- "What types of cars do you offer for rental?"
expected-output: |
We offer three categories of cars:
1. Compact Commuter – Ideal for city driving, fuel-efficient, and budget-friendly. Example: Toyota Corolla, Honda Civic.
2. Family Explorer SUV – Perfect for family trips with spacious seating for up to 7 passengers. Example: Toyota RAV4, Hyundai Santa Fe.
3. Luxury Cruiser – Designed for traveling in style with premium features. Example: Mercedes-Benz E-Class, BMW 5 Series.
- name: "cancellation"
parameters:
- "Can I cancel my car rental booking at any time?"
expected-output: |
Our cancellation policy states that reservations can be canceled up to 11 days prior to the start of the booking period. If the booking period is less than 4 days, cancellations are not permitted.
- name: "teaching"
parameters:
- "Am I allowed to use the rental car to teach someone how to drive?"
expected-output: |
No, rental cars from Miles of Smiles cannot be used for teaching someone to drive, as outlined in our Terms of Use under “Use of Vehicle.”
- name: "damages"
parameters:
- "What happens if the car is damaged during my rental period?"
expected-output: |
You will be held liable for any damage, loss, or theft that occurs during the rental period, as stated in our Terms of Use under “Liability.”
- name: "requirements"
parameters:
- "What are the requirements for making a car rental booking?"
expected-output: |
To make a booking, you need to provide accurate, current, and complete information during the reservation process. All bookings are also subject to vehicle availability.
- name: "race"
parameters:
- "Can I use the rental car for a race or rally?"
expected-output: |
No, rental cars must not be used for any race, rally, or contest. This is prohibited as per our Terms of Use under “Use of Vehicle.”
- name: "family"
parameters:
- "Do you offer cars suitable for long family trips?"
expected-output: |
Yes, we recommend the Family Explorer SUV for long family trips. It offers spacious seating for up to seven passengers, ample cargo space, and advanced driver-assistance features.
- name: "alcohol"
parameters:
- "Is there any restriction on alcohol consumption while using the rental car?"
expected-output: |
Yes, you are not allowed to drive the rental car while under the influence of alcohol or drugs. This is strictly prohibited as stated in our Terms of Use.
- name: "other questions"
parameters:
- What should I do if I have questions unrelated to car rentals?
expected-output: |
For questions unrelated to car rentals, I recommend contacting the appropriate department. I’m here to assist with any car rental-related inquiries!
- name: "categories"
parameters:
- "Which car category is best for someone who values luxury and comfort?"
expected-output: |
If you value luxury and comfort, the Luxury Cruiser is the perfect choice. It offers premium interiors, cutting-edge technology, and unmatched comfort for a first-class driving experience.
and for the content retriever:
---
- name: cancellation_policy_test
parameters:
- What is the cancellation policy for car rentals?
expected-outputs:
- "Reservations can be cancelled up to 11 days prior to the start of the booking period."
- "If the booking period is less than 4 days, cancellations are not permitted."
- name: vehicle_restrictions_test
parameters:
- What are the restrictions on how the rental car can be used?
expected-outputs:
- "All cars rented from Miles of Smiles must not be used:"
- "for any illegal purpose or in connection with any criminal offense."
- "for teaching someone to drive."
- "in any race, rally or contest."
- "while under the influence of alcohol or drugs."
- name: car_types_test
parameters:
- What types of cars are available for rent?
expected-outputs:
- "Compact Commuter"
- "Perfect for city driving and short commutes, this fuel-efficient and easy-to-park car is your ideal companion for urban adventures"
- "Family Explorer SUV"
- "Designed for road trips, family vacations, or adventures with friends, this spacious and versatile SUV offers ample cargo space, comfortable seating for up to seven passengers"
- "Luxury Cruiser"
- "For those who want to travel in style, the Luxury Cruiser delivers unmatched comfort, cutting-edge technology, and a touch of elegance"
- name: car_damage_liability_test
parameters:
- What happens if I damage the car during my rental period?
expected-outputs:
- "Users will be held liable for any damage, loss, or theft that occurs during the rental period"
- name: governing_law_test
parameters:
- Under what law are the terms and conditions governed?
expected-outputs:
- "These terms will be governed by and construed in accordance with the laws of the United States of America"
- "Any disputes relating to these terms will be subject to the exclusive jurisdiction of the courts of United States"