Testing AI-Infused Applications
The quarkus-langchain4j-testing-evaluation-junit5 extension provides a pragmatic and extensible testing framework for evaluating AI-infused and agentic applications.
It integrates with JUnit 5, @QuarkusTest and offers tools for automating evaluation processes, scoring outputs, and generating evaluation reports using customizable evaluation strategies.
You can use this testing framework:
-
using the API directly - to programmatically define and run evaluations in your tests or even applications
-
using JUnit 5 extension - to declaratively define evaluation tests with annotations
Features
The evaluation framework provides:
Core Capabilities:
-
Concurrent Evaluation - Execute evaluations in parallel with configurable concurrency
-
Fluent Builder API - Readable, chainable evaluation definitions (samples, evaluation strategy, reporting)
-
Rich Evaluation Results - Scores, explanations, and metadata for detailed analysis
-
Multiple Evaluation Strategies - Semantic similarity, AI judge, and custom strategies
-
Flexible Sample Loading - YAML, JSON, or custom formats via SPI or CDI
Testing Approaches:
-
Programmatic Testing - Traditional programmatic test definitions with
Scorer -
Declarative Testing -
@EvaluationTestfor annotation-driven test configuration -
Test Templates -
@StrategyTestto run tests against multiple strategies -
Multiple Sample Sources -
@SampleSourcesto combine samples from different files
Assertions and Reporting:
-
Fluent Assertions -
EvaluationAssertionsfor readable test verification -
Readable Test Names -
EvaluationDisplayNameGeneratorshows scores and pass/fail status -
Multi-Format Reports - Generate Markdown, JSON, or custom format reports
Extensibility:
-
Custom Strategies - Implement
EvaluationStrategyfor domain-specific evaluation logic -
Custom Sample Loaders - Support any sample format via
SampleLoaderinterface -
Custom Report Formatters - Generate reports in any format via
ReportFormatterinterface
Maven Dependency
To use the EvaluationExtension, include the following Maven dependency in your pom.xml:
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-testing-evaluation-junit5</artifactId>
<scope>test</scope>
</dependency>
Prerequisites
| For the evaluation framework to work correctly with AI Services, you MUST configure the Maven compiler to preserve parameter names. |
Add this to your pom.xml:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.13.0</version>
<configuration>
<parameters>true</parameters> (1)
</configuration>
</plugin>
| 1 | REQUIRED for @RegisterAiService when using with the evaluation framework |
Without this configuration, you may encounter errors like:
java.lang.IllegalStateException: Duplicate key null
(attempted merging values 0 and 1)
This occurs because the Java compiler strips parameter names by default, and the LangChain4j framework needs them to map method parameters to template variables in @SystemMessage and @UserMessage annotations.
Using the extension
To use the extension, annotate your test class with @ExtendWith(EvaluationExtension.class) or @Evaluate:
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import io.quarkiverse.langchain4j.evaluation.junit5.EvaluationExtension;
@ExtendWith(EvaluationExtension.class)
public class MyScorerTests {
// Test cases go here
}
Or, you can use the @Evaluate annotation:
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import io.quarkiverse.langchain4j.evaluation.junit5.Evaluate;
@Evaluate
public class MyScorerTests {
// Test cases go here
}
This Junit 5 extension can be combined with @QuarkusTest to test Quarkus applications:
import io.quarkus.test.junit.QuarkusTest;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import io.quarkiverse.langchain4j.evaluation.junit5.Evaluate;
@QuarkusTest
@Evaluate
public class MyScorerTests {
// Test cases go here
}
Concepts
Evaluation
An Evaluation is the process of testing whether an AI system’s output meets expectations for a given input. Unlike traditional unit tests that check for exact matches, evaluations assess AI outputs using flexible criteria such as semantic similarity, correctness, or custom domain-specific logic.
In the context of Quarkus Langchain4J, an evaluation consists of:
-
A Sample - The test case containing input parameters and expected output (and optional metadata like tags)
-
A Function Under Test - The AI service, model, or function being evaluated, it receives the input parameters and produce an actual output
-
An Evaluation Strategy - The logic that determines if the actual output is acceptable based on the expected output
-
An Evaluation Result - The outcome containing pass/fail status, optional score, explanation, and metadata
The evaluation process works as follows:
For example, testing a customer support chatbot:
-
Sample: Input = "What are your business hours?", Expected = "We’re open Monday-Friday, 9am-5pm"
-
Function: The chatbot’s response generation (e.g.,
chatbot.chat("What are your business hours?")) -
Strategy: Semantic similarity (≥80% match), AI judge, or custom logic
-
Result: Passed (score=0.92, explanation="Response conveys the same business hours information")
This approach recognizes that AI outputs are often non-deterministic and may vary in phrasing while remaining correct. It also provides flexibility to define evaluation criteria that align with application requirements.
Scorer
The Scorer (io.quarkiverse.langchain4j.testing.evaluation.Scorer) is the cornerstone component that orchestrates evaluations for a set of samples (represented by io.quarkiverse.langchain4j.testing.evaluation.Samples) against a function (part of the application) and evaluation strategies.
It can run evaluations concurrently and produces an EvaluationReport summarizing the results and providing an aggregate score.
The produced score is the percentage of passed evaluations (between 0.0 and 100.0). It is calculated as the ratio of the number of passed evaluations to the total number of evaluations.
In general, tests using the Scorer follow this pattern:
// The AI Service to test, or whatever entry point of your application
@Inject CustomerSupportAssistant assistant;
@Test
void testAiService(
// The scorer instance, with concurrency set to 5
@ScorerConfiguration(concurrency = 5) Scorer scorer,
// The samples loaded from a YAML file, or custom loader
@SampleLocation("src/test/resources/samples.yaml") Samples<String> samples) {
// Define the function that will be evaluated
// The parameters comes from the sample
// The output of this function will be compared to the expected output in the samples
Function<Parameters, String> function = parameters -> assistant.chat(parameters.get(0));
EvaluationReport report = scorer.evaluate(samples, function,
new SemanticSimilarityStrategy(0.8)); // The evaluation strategy
assertThat(report.score()).isGreaterThanOrEqualTo(70); // Assert the score
}
Fluent Builder API
The framework provides a fluent builder API through the Evaluation class for more readable evaluation definitions:
import static io.quarkiverse.langchain4j.testing.evaluation.EvaluationAssertions.assertThat;
@Test
void evaluateWithFluentBuilder() {
EvaluationReport<String> report = Evaluation.<String>builder()
.withSamples("src/test/resources/customer-support-samples.yaml")
.evaluate(params -> bot.chat(params.get(0)))
.using(new SemanticSimilarityStrategy(0.80))
.run();
assertThat(report)
.hasScoreGreaterThan(75.0)
.hasAtLeastPassedEvaluations(4);
}
When using this API, the scorer is created automatically.
The fluent API also supports filtering by tags:
@Test
void evaluateFilteredByTags() {
// Evaluate only samples with specific tags
EvaluationReport<String> report = Evaluation.<String>builder()
.withSamples("src/test/resources/customer-support-samples.yaml")
.withTags("critical") // Only evaluate critical samples
.evaluate(params -> bot.chat(params.get(0)))
.using(new SemanticSimilarityStrategy(0.75))
.run();
assertThat(report)
.hasScoreGreaterThan(70.0)
.hasAtLeastPassedEvaluations(2);
}
You can apply multiple strategies to the same samples:
@Test
void evaluateWithMultipleStrategies() {
// Apply multiple strategies to the same samples
EvaluationReport<String> report = Evaluation.<String>builder()
.withSamples("src/test/resources/smoke-tests.yaml")
.withConcurrency(1)
.evaluate(params -> bot.chat(params.get(0)))
.using(new SemanticSimilarityStrategy(0.85))
.using(new SemanticSimilarityStrategy(0.75))
.run();
assertThat(report).hasScoreGreaterThan(60.0);
}
The fluent API supports deferred executions, allowing you to build evaluation definitions and execute them later:
@Test
void deferredExecution() {
// Build the evaluation definition
EvaluationRunner<String> runner = Evaluation.<String>builder()
.withSamples("src/test/resources/smoke-tests.yaml")
.evaluate(params -> bot.chat(params.get(0)))
.using(new SemanticSimilarityStrategy(0.80))
.build();
// Execute later (can be called multiple times)
EvaluationReport<String> report1 = runner.run();
EvaluationReport<String> report2 = runner.run();
assertThat(report1).hasScoreGreaterThan(60.0);
assertThat(report2).hasScoreGreaterThan(60.0);
}
When multiple strategies are added, each sample is evaluated against all strategies, and results are aggregated in the report.
Samples
A Sample (io.quarkiverse.langchain4j.testing.evaluation.EvaluationSample) represents a single input-output test case.
It includes:
-
a name: the name of the sample,
-
the parameters: the parameter data for the test (input),
-
the expected output: the expected result that will be evaluated,
-
the tags: metadata that can categorize the sample for targeted evaluation (tags are optional).
When tags are set, the score can be calculated per tag (in addition to the global score).
A list of samples is represented by Samples (io.quarkiverse.langchain4j.testing.evaluation.Samples).
Samples can be defined using a builder pattern:
var s1 = EvaluationSample.<String>builder()
.withName("sample1")
.withParameter("value1")
.withExpectedOutput("my expected result2")
.build();
var s2 = EvaluationSample.<String>builder()
.withName("sample2")
.withParameter("value2")
.withExpectedOutput("my expected results")
.build();
Samples<String> samples = new Samples<>(List.of(s1, s2));
Alternatively, samples can be loaded from a YAML file using the @SampleLocation annotation:
- name: Sample1
parameters:
- "parameter1"
expected-output: "expected1"
tags: ["tag1"]
- name: Sample2
parameters:
- "parameter2"
expected-output: "expected2"
tags: ["tag1"]
YAML Sample Format Reference
The YAML format MUST follow this exact structure:
- name: unique_sample_identifier # Required: unique name
parameters: # Required: MUST be plural "parameters"
- "first parameter value"
- "second parameter value" # Optional: multiple parameters
expected-output: | # Required: hyphenated, can be multiline
The expected output content here
Can span multiple lines
tags: ["tag1", "tag2"] # Optional: for filtering
For samples with multiple expected outputs (e.g., testing retrievers), use expected-outputs (plural):
- name: retriever_test_sample
parameters:
- "search query"
expected-outputs: # Note: plural form
- "expected segment 1"
- "expected segment 2"
- "expected segment 3"
tags: ["retriever"]
| Common format requirements: |
-
Use
parameters(plural), notparameterorinput -
Use
expected-output(hyphenated) for single String output, orexpected-outputs(plural) for List<String> output -
You MUST use either
expected-outputorexpected-outputs, not both -
Parameters must be a list, even for single values
-
The file can optionally start with
---as YAML document separator -
Multiline expected output uses the
|(pipe) character for literal block style
| You can implement your own loader to support custom sample formats (see the Custom Sample Loaders section). |
Evaluation Strategy
An EvaluationStrategy (io.quarkiverse.langchain4j.testing.evaluation.EvaluationStrategy) defines how to evaluate a sample.
The framework includes ready-to-use strategies (detailed below), and you can implement custom ones.
/**
* A strategy to evaluate the output of a model.
* @param <T> the type of the output.
*/
public interface EvaluationStrategy<T> {
/**
* Evaluate the output of a model.
* @param sample the sample to evaluate.
* @param output the output of the model.
* @return an EvaluationResult with pass/fail status, optional score, explanation, and metadata.
*/
EvaluationResult evaluate(EvaluationSample<T> sample, T output);
}
Strategies return an EvaluationResult which can be created using:
-
EvaluationResult.passed(double score): Create a passing result with a score (0.0-1.0) -
EvaluationResult.failed(double score, String explanation): Create a failing result with score and explanation -
EvaluationResult.failed(String explanation): Create a failing result with explanation (score defaults to 0.0) -
EvaluationResult.fromBoolean(boolean): Convert boolean to result (passed=1.0, failed=0.0) -
Fluent API for adding explanations and metadata:
var result = EvaluationResult
.passed(0.95)
.withExplanation("Response semantically matches expected output")
.withMetadata(Map.of("similarityScore", 0.95, "method", "cosine"));
Evaluation Report
The EvaluationReport aggregates the results of all evaluations.
It provides:
-
A global score (percentage of passed evaluations, 0.0-100.0)
-
Scores per tag (for categorized evaluations)
-
Access to individual evaluation results with scores, explanations, and metadata
-
Report generation in multiple formats (Markdown, JSON, or custom formats)
Fluent Assertions for Reports
The framework provides EvaluationAssertions for readable test assertions:
import static io.quarkiverse.langchain4j.testing.evaluation.EvaluationAssertions.assertThat;
assertThat(report)
.hasScore(80.0) // Exact score
.hasScoreGreaterThan(70.0) // Minimum score
.hasScoreBetween(70.0, 90.0) // Score range
.hasPassedCount(8) // Exact passed count
.hasAtLeastPassedEvaluations(4) // Minimum passed
.hasFailedCount(2) // Exact failed count
.hasAtMostFailedEvaluations(3) // Maximum failed
.hasEvaluationCount(10); // Total count
Writing Evaluation Test
You can write evaluation tests in multiple ways:
-
programmatic using a
scorerobject directly injected in a field or as a method parameter -
declarative using the
@EvaluationTestannotation
Example Test Using Field Injection (programmatic)
@ExtendWith(EvaluationExtension.class)
public class ScorerFieldInjectionTest {
@ScorerConfiguration(concurrency = 4)
private Scorer scorer;
@Test
void evaluateSamples() {
// Define test samples
Samples<String> samples = new Samples<>(
EvaluationSample.<String>builder().withName("Sample1").withParameter("p1").withExpectedOutput("expected1").build(),
EvaluationSample.<String>builder().withName("Sample2").withParameter("p2").withExpectedOutput("expected2").build()
);
// Define evaluation strategies
EvaluationStrategy<String> strategy = new SemanticSimilarityStrategy(0.85);
// Evaluate samples
EvaluationReport report = scorer.evaluate(samples, parameters -> {
// Replace with your function under test
return "actualOutput";
}, strategy);
// Assert results
assertThat(report.score()).isGreaterThan(50.0);
}
}
Example Test Using Parameter Injection (programmatic)
@ExtendWith(EvaluationExtension.class)
public class ScorerParameterInjectionTest {
// ....
@Test
void evaluateWithInjectedScorer(
@ScorerConfiguration(concurrency = 2) Scorer scorer,
@SampleLocation("test-samples.yaml") Samples<String> samples
) {
// Use an evaluation strategy
EvaluationStrategy<String> strategy = new AiJudgeStrategy(myChatLanguageModel);
// Evaluate samples
EvaluationReport report = scorer.evaluate(samples, parameters -> {
// Replace with your function under test
return "actualOutput";
}, strategy);
// Assert results
assertThat(report).hasScoreGreaterThan(50.0);
}
}
Declarative Testing with @EvaluationTest
The @EvaluationTest annotation provides a declarative way to define evaluation tests without writing explicit test methods:
@QuarkusTest
@Evaluate
@DisplayNameGeneration(EvaluationDisplayNameGenerator.class)
public class DeclarativeEvaluationTest {
@Inject
CustomerSupportBot bot;
/**
* Define a reusable evaluation function.
* This function will be referenced by name in the test annotations.
*/
@EvaluationFunction("chatbot")
public Function<Parameters, String> chatbotFunction() {
return params -> bot.chat(params.get(0));
}
/**
* Declarative test using @EvaluationTest.
* The framework automatically loads samples, evaluates them,
* and asserts the minimum score.
*/
@EvaluationTest(
samples = "smoke-tests.yaml",
strategy = SemanticSimilarityStrategy.class,
function = "chatbot",
minScore = 70.0
)
void smokeTestsWithSemanticSimilarity() {
// Test body can be empty - evaluation happens automatically
// The test will fail if score is below 70%
}
/**
* Another @EvaluationTest with a different configuration.
*/
@EvaluationTest(
samples = "customer-support-samples.yaml",
strategy = AiJudgeStrategy.class,
function = "chatbot",
minScore = 85.0
)
void criticalCustomerSupportEvaluation() {
// Higher threshold for critical evaluations
}
}
When using @EvaluationTest, the framework automatically:
-
Loads samples from the specified location
-
Injects the evaluation function by name
-
Applies the specified evaluation strategy (the instance is created using a no-arg constructor)
-
Evaluates the samples and generates a report
Test Templates with @StrategyTest
The @StrategyTest annotation enables testing the same function against multiple evaluation strategies:
@QuarkusTest
@Evaluate
@DisplayNameGeneration(EvaluationDisplayNameGenerator.class)
public class DeclarativeEvaluationTest {
@Inject
CustomerSupportBot bot;
/**
* Test using multiple strategies with @StrategyTest.
* The test runs once for each strategy.
*/
@StrategyTest(strategies = {
SemanticSimilarityStrategy.class,
AiJudgeStrategy.class,
})
void customerSupportWithMultipleStrategies(
@SampleLocation("src/test/resources/smoke-tests.yaml") Samples<String> samples,
EvaluationStrategy<String> strategy,
Scorer scorer) {
// This test method will execute twice:
// 1. Once with SemanticSimilarityStrategy
// 2. Once with AiJudgeStrategy
// Each execution appears as a separate test in the results
var report = scorer.evaluate(
samples,
params -> bot.chat(params.get(0)),
strategy);
System.out.printf("Strategy %s - Score: %.2f%%%n",
strategy.getClass().getSimpleName(),
report.score());
assertThat(report).hasScoreGreaterThan(60.0);
}
}
The customerSupportWithMultipleStrategies method is invoked once for each specified strategy, allowing you to compare performance across different evaluation approaches.
Loading Samples from Multiple Sources
When you need to combine samples from multiple files or sources, use the @SampleSources annotation. This is useful for:
-
Combining different test suites (e.g., smoke tests + regression tests)
-
Organizing samples by category or feature in separate files
-
Reusing common sample sets across different tests
Basic Usage
The @SampleSources annotation accepts an array of @SampleLocation annotations and merges all samples into a single Samples instance:
@QuarkusTest
@Evaluate
public class CombinedEvaluationTest {
@Inject
CustomerSupportBot bot;
@Test
void evaluateWithMultipleSampleSources(
@SampleSources({
@SampleLocation("src/test/resources/smoke-tests.yaml"),
@SampleLocation("src/test/resources/regression-tests.yaml"),
@SampleLocation("src/test/resources/edge-cases.yaml")
}) Samples<String> samples,
@ScorerConfiguration(concurrency = 3) Scorer scorer) {
// All samples from all three files are combined
EvaluationReport<String> report = scorer.evaluate(
samples,
params -> bot.chat(params.get(0)),
new SemanticSimilarityStrategy(0.85));
assertThat(report)
.hasScoreGreaterThan(80.0)
.hasAtLeastPassedEvaluations(4);
}
}
Sample Order Preservation
Samples from multiple sources are combined in the order they appear in the @SampleSources array. Within each source, the original order is preserved:
@Test
void samplesAreOrderedCorrectly(
@SampleSources({
@SampleLocation("src/test/resources/smoke-tests.yaml"), // First
@SampleLocation("src/test/resources/regression-tests.yaml") // Second
}) Samples<String> samples) {
var sampleNames = samples.stream()
.map(sample -> sample.name())
.toList();
// Samples from smoke-tests.yaml come first, then regression-tests.yaml
assertThat(sampleNames).containsExactly(
"Smoke Test 1",
"Smoke Test 2",
"Regression Test 1",
"Regression Test 2"
);
}
Filtering Combined Samples by Tags
You can filter combined samples by tags, which is useful when you want to run specific subsets of your combined sample set:
@Test
void evaluateOnlyCriticalSamples(
@SampleSources({
@SampleLocation("src/test/resources/smoke-tests.yaml"),
@SampleLocation("src/test/resources/regression-tests.yaml")
}) Samples<String> samples,
Scorer scorer) {
// Filter to only evaluate samples tagged as "critical"
var criticalSamples = samples.filterByTags("critical");
var report = scorer.evaluate(
criticalSamples,
params -> bot.chat(params.get(0)),
new SemanticSimilarityStrategy(0.90));
assertThat(report)
.hasPassedForTag("critical")
.hasScoreGreaterThan(95.0);
}
Use with @StrategyTest
@SampleSources can be combined with @StrategyTest to test multiple sample sources against multiple strategies:
@StrategyTest(strategies = {
SemanticSimilarityStrategy.class,
AiJudgeStrategy.class
})
void evaluateMultipleSourcesWithMultipleStrategies(
@SampleSources({
@SampleLocation("src/test/resources/smoke-tests.yaml"),
@SampleLocation("src/test/resources/regression-tests.yaml")
}) Samples<String> samples,
EvaluationStrategy<String> strategy,
Scorer scorer) {
// This test runs 2 times (once per strategy)
// Each execution uses all combined samples
var report = scorer.evaluate(
samples,
params -> bot.chat(params.get(0)),
strategy);
assertThat(report).hasScoreGreaterThan(70.0);
}
All samples from all specified sources are merged into a single Samples instance, so filtering, evaluation, and reporting work seamlessly across the combined set.
|
Readable Test Names with EvaluationDisplayNameGenerator
You can produce readable test display names showing sample names, scores, and pass/fail status:
@QuarkusTest
@Evaluate
@DisplayNameGeneration(EvaluationDisplayNameGenerator.class)
public class DeclarativeEvaluationTest {
@Inject
CustomerSupportBot bot;
@EvaluationFunction("chatbot")
public Function<Parameters, String> chatbotFunction() {
return params -> bot.chat(params.get(0));
}
@EvaluationTest(
samples = "smoke-tests.yaml",
strategy = SemanticSimilarityStrategy.class,
function = "chatbot",
minScore = 70.0
)
void smokeTestsWithSemanticSimilarity() {
// Test names will show sample results with scores
}
}
Test Suite Reporting
Generate aggregate reports across all tests using @ReportConfiguration:
@QuarkusTest
@Evaluate
@ExtendWith(SuiteEvaluationReporter.class) // Enable test suite reporting
public class SuiteReportingTest {
@ReportConfiguration(
outputDir = "target",
fileName = "test-suite-report",
formats = {"markdown", "json"},
includeDetails = true
)
private EvaluationReport<String> smokeTestReport;
@Test
void smokeTests(@ScorerConfiguration(concurrency = 5) Scorer scorer,
@SampleLocation("smoke-tests.yaml") Samples<String> samples) {
smokeTestReport = scorer.evaluate(samples,
params -> myService.process(params.get(0)),
new SemanticSimilarityStrategy(0.8));
// Register report for test suite aggregation
ReportRegistry.registerReport(
getClass().getSimpleName(),
"smokeTestReport",
smokeTestReport);
}
}
This generates:
-
target/evaluation-suite-report.md: Aggregated test suite report -
target/evaluation-suite-report.json: Aggregated test suite JSON report
Built-in Evaluation Strategies
Semantic Similarity
The SemanticSimilarityStrategy (io.quarkiverse.langchain4j.testing.evaluation.similarity.SemanticSimilarityStrategy) evaluates the similarity between the actual output and the expected output using cosine similarity.
It requires an embedding model and a minimum similarity threshold.
Maven Dependency:
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-testing-evaluation-semantic-similarity</artifactId>
<scope>test</scope>
</dependency>
Examples:
EvaluationStrategy<String> strategy = new SemanticSimilarityStrategy(0.9);
EvaluationStrategy<String> strategy2 = new SemanticSimilarityStrategy(embeddingModel, 0.85);
AI Judge
The AiJudgeStrategy leverages an AI model to determine if the actual output matches the expected output.
It uses a configurable evaluation prompt and ChatModel.
Maven Dependency
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-testing-evaluation-ai-judge</artifactId>
<scope>test</scope>
</dependency>
Basic Example:
EvaluationStrategy<String> strategy = new AiJudgeStrategy(myChatLanguageModel, """
You are an AI evaluating a response and the expected output.
You need to evaluate whether the model response is correct or not.
Return true if the response is correct, false otherwise.
Response to evaluate: {response}
Expected output: {expected_output}
""");
Advanced Usage: Structured JSON Output
For reliable parsing and detailed evaluation results, request JSON output in your judge prompt:
@ApplicationScoped
@RegisterAiService
public interface LlmJudgeService {
@SystemMessage("""
You are an expert evaluator for an AI application.
Your role is to judge whether an actual response correctly
addresses what was expected based on the user's question.
Evaluation Criteria:
- Does the response address the user's question?
- Is the information accurate and relevant?
- Does it cover key concepts from the expected output?
- Is it helpful and informative?
Scoring Guidelines:
- 100: Perfect response - fully addresses the question
- 80-99: Good response - addresses well, minor gaps
- 60-79: Adequate response - addresses but missing info
- 40-59: Partial response - only partially addresses
- 20-39: Poor response - barely relevant or mostly incorrect
- 0-19: Incorrect response - completely wrong or off-topic
Respond with ONLY a JSON object in this exact format:
{
"score": <number 0-100>,
"passed": <true or false>,
"explanation": "<brief explanation of the score>"
}
""")
@UserMessage("""
Question: {{question}}
Expected Behavior: {{expectedOutput}}
Actual Response: {{actualResponse}}
Evaluate the actual response and provide your judgment as JSON.
""")
String judgeResponse(String question, String expectedOutput, String actualResponse);
}
Robust JSON Parsing
LLMs may wrap JSON in markdown code blocks. Handle this robustly:
private JudgmentResult parseJudgment(String response) throws Exception {
String cleaned = response.trim();
// Remove markdown code blocks
if (cleaned.startsWith("```json")) {
cleaned = cleaned.substring(7);
}
if (cleaned.startsWith("```")) {
cleaned = cleaned.substring(3);
}
if (cleaned.endsWith("```")) {
cleaned = cleaned.substring(0, cleaned.length() - 3);
}
// Parse JSON
JsonNode root = objectMapper.readTree(cleaned.trim());
int score = root.get("score").asInt();
boolean passed = root.get("passed").asBoolean();
String explanation = root.get("explanation").asText();
return new JudgmentResult(score, passed, explanation);
}
private record JudgmentResult(int score, boolean passed, String explanation) {}
Fallback Strategy
Always provide a fallback when LLM judge fails:
public class LlmJudgeEvaluationStrategy implements EvaluationStrategy<String> {
private final LlmJudgeService judgeService;
@Override
public EvaluationResult evaluate(EvaluationSample<String> sample, String output) {
try {
return evaluateWithLlmJudge(sample, output);
} catch (Exception e) {
LOG.warn("LLM judge failed for sample: " + sample.name(), e);
return fallbackEvaluation(sample, output);
}
}
private EvaluationResult fallbackEvaluation(
EvaluationSample<String> sample, String output) {
// Simple keyword-based fallback
boolean contains = output.toLowerCase()
.contains(sample.expectedOutput().toLowerCase());
return EvaluationResult.fromBoolean(contains)
.withExplanation("Fallback evaluation (LLM judge unavailable)");
}
}
Creating a Custom Evaluation Strategy
To implement your own evaluation strategy, implement the EvaluationStrategy interface:
import io.quarkiverse.langchain4j.testing.evaluation.*;
public class MyCustomStrategy implements EvaluationStrategy<String> {
@Override
public EvaluationResult evaluate(EvaluationSample<String> sample, String output) {
// Custom evaluation logic
boolean matches = output.equalsIgnoreCase(sample.expectedOutput());
return EvaluationResult.fromBoolean(matches);
}
}
Then, use the custom strategy in your test:
EvaluationStrategy<String> strategy = new MyCustomStrategy();
EvaluationReport report = scorer.evaluate(samples, parameters -> {
return "actualOutput";
}, strategy);
Here is an example of a custom strategy that can be used to verify the correctness of a vector search:
public class TextSegmentEvaluationStrategy implements EvaluationStrategy<List<String>> {
@Override
public EvaluationResult evaluate(EvaluationSample<List<String>> sample, List<String> response) {
List<String> expected = sample.expectedOutput();
int found = 0;
for (String seg : expected) {
// Make sure that the response contains the expected segment
boolean segFound = false;
for (String s : response) {
if (s.toLowerCase().contains(seg.toLowerCase())) {
segFound = true;
found++;
break;
}
}
if (!segFound) {
System.out.println("Segment not found: " + seg);
}
}
double score = (double) found / expected.size();
boolean passed = found == expected.size();
if (passed) {
return EvaluationResult.passed(score)
.withExplanation(String.format("Found %d of %d expected segments", found, expected.size()));
} else {
return EvaluationResult.failed(score,
String.format("Found %d of %d expected segments", found, expected.size()));
}
}
}
Injecting Samples
You can load samples directly from a YAML file using the @SampleLocation annotation:
- name: Sample1
parameters:
- "value1"
expected-output: "expected1"
tags: ["tag1"]
- name: Sample2
parameters:
- "value2"
expected-output: "expected2"
tags: ["tag2"]
Then, inject the samples into your test method:
@Test
void evaluateWithSamples(@SampleLocation("test-samples.yaml") Samples<String> samples) {
// Use samples in your test
}
@SampleLocation Path Resolution
The @SampleLocation annotation expects filesystem paths relative to the project root, NOT classpath resource paths.
@SampleLocation("src/test/resources/samples.yaml") // Relative to project root
@SampleLocation("test-data/samples.yaml") // Also relative to project root
@SampleLocation("/absolute/path/to/samples.yaml") // Absolute path (not recommended)
| Always use the full path from the project root for clarity and reliability. |
Custom Sample Loaders
The framework supports custom sample loaders through a hybrid discovery mechanism using both Java ServiceLoader (SPI) and CDI.
Implementing a Custom Sample Loader
Create a class implementing the SampleLoader interface:
package com.example;
import io.quarkiverse.langchain4j.testing.evaluation.*;
public class JsonSampleLoader implements SampleLoader {
@Override
public boolean supports(String source) {
return source.endsWith(".json");
}
@Override
public <T> Samples<T> load(String source, Class<T> outputType) {
// Load and parse JSON file
// Return Samples<T>
}
}
Registering via ServiceLoader (SPI)
Create META-INF/services/io.quarkiverse.langchain4j.testing.evaluation.SampleLoader:
com.example.JsonSampleLoader
Registering via CDI
Make your loader a CDI bean:
@ApplicationScoped
public class JsonSampleLoader implements SampleLoader {
// Implementation
}
The framework automatically discovers loaders from both sources, with CDI beans taking precedence. CDI is the recommended approach for better integration with Quarkus.
Also, the samples can be loaded from a remote location (HTTP URL, database, etc.) by implementing the custom logic in the load method.
Custom Report Formatters
Create custom report formatters to export evaluation results in different formats.
Implementing a Custom Report Formatter
package com.example;
import io.quarkiverse.langchain4j.testing.evaluation.*;
import java.io.Writer;
import java.util.Map;
public class HtmlReportFormatter implements ReportFormatter {
@Override
public String format() {
return "html";
}
@Override
public String fileExtension() {
return ".html";
}
@Override
public void format(EvaluationReport<?> report, Writer writer, Map<String, Object> config) {
// Generate HTML report
writer.write("<html><body>");
writer.write("<h1>Evaluation Report</h1>");
writer.write("<p>Score: " + report.score() + "%</p>");
// ... more HTML generation
writer.write("</body></html>");
}
}
Registering via ServiceLoader (SPI)
Create META-INF/services/io.quarkiverse.langchain4j.testing.evaluation.ReportFormatter:
com.example.HtmlReportFormatter
Testing Guardrails
Guardrails are critical for production AI applications. Test them thoroughly for accuracy.
Unit Testing InputGuardrails
@QuarkusTest
@Evaluate
public class GuardrailEvaluationTest {
@Inject
OffTopicDetectionGuardrail guardrail;
@Test
public void testGuardrailAccuracy(
@ScorerConfiguration(concurrency = 2) Scorer scorer,
@SampleLocation("src/test/resources/guardrail-samples.yaml") Samples<String> samples) {
EvaluationReport<String> report = scorer.evaluate(
samples,
params -> evaluateGuardrail(params.get(0)),
new GuardrailEvaluationStrategy()
);
// Expect perfect accuracy for guardrails
assertThat(report).hasScore(100.0);
}
private String evaluateGuardrail(String input) {
try {
UserMessage message = UserMessage.from(input);
guardrail.validate(message);
return "ACCEPT";
} catch (InputGuardrailException e) {
return "REJECT:" + e.getMessage();
}
}
}
Guardrail Evaluation Strategy
public class GuardrailEvaluationStrategy implements EvaluationStrategy<String> {
@Override
public EvaluationResult evaluate(EvaluationSample<String> sample, String output) {
String expectedBehavior = sample.expectedOutput();
// Check if output matches expected behavior
if (expectedBehavior.startsWith("ACCEPT")) {
if (output.equals("ACCEPT")) {
return EvaluationResult.passed(1.0)
.withExplanation("Correctly accepted on-topic question");
} else {
return EvaluationResult.failed("False negative: incorrectly rejected on-topic question");
}
} else if (expectedBehavior.startsWith("REJECT")) {
if (output.startsWith("REJECT")) {
return EvaluationResult.passed(1.0)
.withExplanation("Correctly rejected off-topic question");
} else {
return EvaluationResult.failed("False positive: incorrectly accepted off-topic question");
}
}
return EvaluationResult.failed("Unknown expected behavior: " + expectedBehavior);
}
}
Sample Format for Guardrails
- name: on_topic_ai_services
parameters:
- "How do I use @RegisterAiService?"
expected-output: "ACCEPT"
tags: ["on-topic", "ai-services"]
- name: on_topic_rag
parameters:
- "How does RAG work in Quarkus LangChain4j?"
expected-output: "ACCEPT"
tags: ["on-topic", "rag"]
- name: off_topic_weather
parameters:
- "What's the weather today?"
expected-output: "REJECT"
tags: ["off-topic", "weather"]
- name: off_topic_cooking
parameters:
- "How do I bake a cake?"
expected-output: "REJECT"
tags: ["off-topic", "cooking"]
Metrics to Track
For production guardrails, track these metrics:
-
Accuracy: Overall percentage of correct classifications
-
False Positive Rate: Incorrectly accepted off-topic questions
-
False Negative Rate: Incorrectly rejected on-topic questions
-
Precision: Of accepted questions, how many were truly on-topic
-
Recall: Of all on-topic questions, how many were accepted
Target: 100% accuracy for guardrails, 0% false negatives (security critical)
Testing REST APIs
You can combine RestAssured with the evaluation framework to test HTTP endpoints:
@QuarkusTest
@Evaluate
public class RestApiEvaluationTest {
@Inject
LlmJudgeService judgeService;
@Test
public void testRestApi(
@ScorerConfiguration(concurrency = 2) Scorer scorer,
@SampleLocation("src/test/resources/api-samples.yaml") Samples<String> samples) {
EvaluationReport<String> report = scorer.evaluate(
samples,
params -> callApi(params.get(0)),
new LlmJudgeEvaluationStrategy(judgeService)
);
assertThat(report).hasScoreGreaterThan(80.0);
}
private String callApi(String input) {
try {
Response response = RestAssured
.given()
.contentType(ContentType.JSON)
.body(Map.of("message", input, "memoryId", UUID.randomUUID().toString()))
.when()
.post("/api/chat")
.then()
.extract().response();
int statusCode = response.getStatusCode();
String body = response.body().asString();
// Format for evaluation: encode both status and response
return String.format("STATUS:%d|RESPONSE:%s", statusCode, body);
} catch (Exception e) {
return String.format("STATUS:500|ERROR:%s", e.getMessage());
}
}
}
Your evaluation strategy can then parse this format:
public class RestApiEvaluationStrategy implements EvaluationStrategy<String> {
@Override
public EvaluationResult evaluate(EvaluationSample<String> sample, String output) {
// Parse: "STATUS:200|RESPONSE:..."
String[] parts = output.split("\\|", 2);
int statusCode = Integer.parseInt(parts[0].replace("STATUS:", ""));
String response = parts[1].substring(parts[1].indexOf(":") + 1);
// Validate based on expected behavior
if (statusCode >= 200 && statusCode < 300) {
return evaluateSuccessResponse(sample, response);
} else {
return evaluateErrorResponse(sample, statusCode, response);
}
}
}
Sample format for API tests:
- name: successful_query
parameters:
- "How do I use Quarkus LangChain4j?"
expected-output: |
Should explain basic setup with dependency and @RegisterAiService.
tags: ["success", "on-topic"]
- name: off_topic_rejection
parameters:
- "What's the weather today?"
expected-output: |
Should be rejected with appropriate message.
tags: ["rejection", "off-topic"]
- name: empty_input_validation
parameters:
- ""
expected-output: |
Should return HTTP 400 with validation error.
tags: ["validation", "edge-case"]
Multi-Layered Testing Strategy
// Level 1: Component Test
@Test
void testRetrieverComponent(Scorer scorer, Samples<String> samples) {
EvaluationReport<String> report = scorer.evaluate(
samples,
params -> retriever.retrieve(params.get(0)),
new KeywordRelevanceStrategy() // Fast, simple
);
assertThat(report).hasScoreGreaterThan(70.0);
}
// Level 2: Integration Test
@Test
void testAiServiceWithRetriever(Scorer scorer, Samples<String> samples) {
EvaluationReport<String> report = scorer.evaluate(
samples,
params -> aiService.chat(params.get(0)),
new SemanticSimilarityStrategy(0.80) // More sophisticated
);
assertThat(report).hasScoreGreaterThan(80.0);
}
// Level 3: End-to-End Test
@Test
void testCompleteChatFlow(Scorer scorer, Samples<String> samples) {
EvaluationReport<String> report = scorer.evaluate(
samples,
params -> chatService.processMessage(params.get(0)),
new SemanticSimilarityStrategy(0.85)
);
assertThat(report).hasScoreGreaterThan(85.0);
}
// Level 4: REST API Test
@Test
void testRestApiEndpoint(Scorer scorer, Samples<String> samples) {
EvaluationReport<String> report = scorer.evaluate(
samples,
params -> callHttpApi(params.get(0)),
new LlmJudgeEvaluationStrategy(judge) // Most thorough
);
assertThat(report).hasScoreGreaterThan(80.0);
}
Testing Edge Cases
Real-world AI applications must handle edge cases gracefully. Include these in your evaluation tests.
Common Edge Cases to Test
# Empty input
- name: empty_message
parameters:
- ""
expected-output: |
Should return validation error or prompt user for input.
tags: ["edge-case", "validation"]
# Very long input
- name: very_long_question
parameters:
- "Can you explain Quarkus LangChain4j in detail including all features like AI services, RAG, evaluation, guardrails, tools, embeddings, chat memory, and how they all work together with examples and best practices for each feature?"
expected-output: |
Should handle long input gracefully without truncation errors.
tags: ["edge-case", "long-input"]
# Special characters
- name: special_characters
parameters:
- "How do I use @RegisterAiService & <configure> embeddings? (with examples!)"
expected-output: |
Should handle special characters correctly in parsing and response.
tags: ["edge-case", "special-chars"]
# Multilingual input
- name: multilingual_question
parameters:
- "Comment utiliser Quarkus LangChain4j?" # French: How to use
expected-output: |
Should recognize as on-topic and attempt to answer.
tags: ["edge-case", "multilingual"]
# Ambiguous follow-up
- name: follow_up_without_context
parameters:
- "Can you explain that in more detail?"
expected-output: |
Should ask for clarification or context.
tags: ["edge-case", "ambiguous"]
# SQL injection attempt
- name: sql_injection_attempt
parameters:
- "'; DROP TABLE users; --"
expected-output: |
Should handle safely without execution or error.
tags: ["edge-case", "security"]
# Extremely off-topic
- name: completely_unrelated
parameters:
- "What is the meaning of life?"
expected-output: "REJECT"
tags: ["edge-case", "off-topic"]
# Mixed language
- name: code_in_question
parameters:
- "Why does `@Inject ChatModel model;` not work?"
expected-output: |
Should parse code blocks and provide relevant answer.
tags: ["edge-case", "code-snippet"]
Cost Considerations
Evaluation tests make API calls which incur costs. Estimate costs before running large test suites.
Per-Sample Cost Estimation
Typical costs per sample (using OpenAI):
RAG Retrieval:
- Embeddings (text-embedding-3-small): ~$0.001
- 5 retrievals: ~$0.005
AI Response Generation:
- Chat completion (gpt-4o-mini): ~$0.01-0.05
- Depends on context size and response length
LLM as Judge:
- Evaluation (gpt-4o-mini): ~$0.001-0.003
- Input: question + expected + actual (~300-600 tokens)
- Output: JSON judgment (~100-150 tokens)
Guardrail Classification:
- Topic detection (gpt-4o-mini): ~$0.0001-0.001
- Very short prompts and responses
Example: 20-sample test suite
- On-topic: ~$0.015 × 14 = $0.21
- Off-topic: ~$0.001 × 5 = $0.005
- Edge cases: ~$0.01 × 1 = $0.01
- Total: ~$0.22 per test run
Cost Optimization Strategies
-
Use Smaller Models for Evaluation
// Use gpt-4o-mini for judges (10x cheaper than gpt-4) @RegisterAiService(chatModelName = "gpt-4o-mini") public interface LlmJudgeService { ... } -
Reuse Vector Stores
-
Filter by Tags
// Run only critical tests during development Samples<String> criticalOnly = samples.filterByTags("critical");
Troubleshooting
"Duplicate key null" Error
Full Error:
java.lang.IllegalStateException: Duplicate key null
(attempted merging values 0 and 1)
at io.quarkiverse.langchain4j.deployment.AiServicesProcessor...
Cause: Maven compiler not preserving parameter names for @RegisterAiService methods
Solution: Add to pom.xml:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<parameters>true</parameters>
</configuration>
</plugin>
Then run: mvn clean compile test-compile
"File not found" for Sample Loading
Error: File not found: evaluation-samples/samples.yaml
Cause: @SampleLocation expects filesystem paths, not classpath paths
Solution: Use full path from project root:
@SampleLocation("src/test/resources/evaluation-samples/samples.yaml")
YAML Parsing Errors
Error: Cannot parse YAML sample file
Common causes:
-
Using
expectedinstead ofexpected-output -
Using
parameterinstead ofparameters(must be plural) -
Not using list format for parameters
Solution: Follow exact YAML format:
- name: sample1
parameters: # Plural, must be list
- "value"
expected-output: | # Hyphenated
Expected result
Low Evaluation Scores
Symptom: Tests consistently scoring lower than expected
Possible causes:
-
Limited RAG Content
-
Only 3 documentation pages ingested
-
Questions about topics not in those pages will fail
-
Solution: Ingest more relevant documentation
-
-
Threshold Too High
-
AI responses are non-deterministic
-
100% semantic similarity is unrealistic
-
Solution: Use 70-85% thresholds for semantic similarity
-
-
Wrong Evaluation Strategy
-
Keyword matching too strict for AI responses
-
Solution: Use SemanticSimilarity or AI Judge instead
-
-
Poor Sample Quality
-
Expected outputs too specific or too generic
-
Solution: Review and improve sample quality
-
Example adjustments:
// Too strict - may fail
assertThat(report).hasScoreGreaterThan(95.0);
// Reasonable for AI systems
assertThat(report).hasScoreGreaterThan(75.0);
Example of tests using Quarkus
Let’s imagine an AI Service used by a Chatbot to generate responses. Let’s also imagine that this AI Service has access to a (RAG) content retriever. The associated tests could be:
package dev.langchain4j.quarkus;
import dev.langchain4j.data.message.UserMessage;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.openai.OpenAiChatModel;
import dev.langchain4j.quarkus.workshop.CustomerSupportAssistant;
import dev.langchain4j.rag.AugmentationRequest;
import dev.langchain4j.rag.RetrievalAugmentor;
import dev.langchain4j.rag.content.Content;
import dev.langchain4j.rag.query.Metadata;
import io.quarkiverse.langchain4j.evaluation.junit5.Evaluate;
import io.quarkiverse.langchain4j.evaluation.junit5.SampleLocation;
import io.quarkiverse.langchain4j.evaluation.junit5.ScorerConfiguration;
import io.quarkiverse.langchain4j.testing.evaluation.EvaluationReport;
import io.quarkiverse.langchain4j.testing.evaluation.EvaluationSample;
import io.quarkiverse.langchain4j.testing.evaluation.EvaluationStrategy;
import io.quarkiverse.langchain4j.testing.evaluation.Parameters;
import io.quarkiverse.langchain4j.testing.evaluation.Samples;
import io.quarkiverse.langchain4j.testing.evaluation.Scorer;
import io.quarkiverse.langchain4j.testing.evaluation.judge.AiJudgeStrategy;
import io.quarkiverse.langchain4j.testing.evaluation.similarity.SemanticSimilarityStrategy;
import io.quarkus.test.junit.QuarkusTest;
import jakarta.enterprise.context.control.ActivateRequestContext;
import jakarta.inject.Inject;
import jakarta.inject.Singleton;
import org.junit.jupiter.api.Test;
import java.util.List;
import java.util.UUID;
import java.util.function.Function;
import static org.assertj.core.api.Assertions.assertThat;
@QuarkusTest
@Evaluate
public class AssistantTest {
// Just a function calling the AI Service and returning the response as a String.
@Inject
AiServiceEvaluation aiServiceEvaluation;
// The content retriever from the RAG pattern I want to test
@Inject
RetrievalAugmentor retriever;
// Test the AI Service using the Semantic Similarity Strategy
@Test
void testAiService(@ScorerConfiguration(concurrency = 5) Scorer scorer,
@SampleLocation("src/test/resources/samples.yaml") Samples<String> samples) {
EvaluationReport report = scorer.evaluate(samples, aiServiceEvaluation,
new SemanticSimilarityStrategy(0.8));
assertThat(report.score()).isGreaterThanOrEqualTo(70);
}
// Test the AI Service using the AI Judge Strategy
@Test
void testAiServiceUsingAiJudge(Scorer scorer,
@SampleLocation("src/test/resources/samples.yaml") Samples<String> samples) {
var judge = OpenAiChatModel.builder()
.baseUrl("http://localhost:11434/v1") // Ollama
.modelName("mistral")
.build();
EvaluationReport report = scorer.evaluate(samples, aiServiceEvaluation,
new AiJudgeStrategy(judge));
assertThat(report.score()).isGreaterThanOrEqualTo(70);
}
// Evaluation strategy can be CDI beans (which means they can easily be injected)
@Inject
TextSegmentEvaluationStrategy textSegmentEvaluationStrategy;
// Test of the RAG retriever
@Test
void testRagRetriever(Scorer scorer, @SampleLocation("src/test/resources/content-retriever-samples.yaml") Samples<List<String>> samples) {
EvaluationReport report = scorer.evaluate(samples, i -> runRetriever(i.get(0)),
textSegmentEvaluationStrategy);
assertThat(report.score()).isEqualTo(100); // Expect full success
}
private List<String> runRetriever(String query) {
UserMessage message = UserMessage.userMessage(query);
AugmentationRequest request = new AugmentationRequest(message,
new Metadata(message, UUID.randomUUID().toString(), List.of()));
var res = retriever.augment(request);
return res.contents().stream().map(Content::textSegment).map(TextSegment::text).toList();
}
@Singleton
public static class AiServiceEvaluation implements Function<Parameters, String> {
@Inject
CustomerSupportAssistant assistant;
@ActivateRequestContext
@Override
public String apply(Parameters params) {
return assistant.chat(UUID.randomUUID().toString(), params.get(0)).collect()
.in(StringBuilder::new, StringBuilder::append).map(StringBuilder::toString).await().indefinitely();
}
}
@Singleton
public static class TextSegmentEvaluationStrategy implements EvaluationStrategy<List<String>> {
@Override
public EvaluationResult evaluate(EvaluationSample<List<String>> sample, List<String> response) {
List<String> expected = sample.expectedOutput();
int found = 0;
for (String seg : expected) {
// Make sure that the response contains the expected segment
boolean segFound = false;
for (String s : response) {
if (s.toLowerCase().contains(seg.toLowerCase())) {
segFound = true;
found++;
break;
}
}
if (!segFound) {
System.out.println("Segment not found: " + seg);
}
}
double score = (double) found / expected.size();
boolean passed = found == expected.size();
if (passed) {
return EvaluationResult.passed(score)
.withExplanation(String.format("Found %d of %d expected segments", found, expected.size()));
} else {
return EvaluationResult.failed(score,
String.format("Found %d of %d expected segments", found, expected.size()));
}
}
}
}
This test class demonstrates how to use the EvaluationExtension to evaluate an AI Service and a RAG retriever using different strategies.
The associated samples are:
---
- name: "car types"
parameters:
- "What types of cars do you offer for rental?"
expected-output: |
We offer three categories of cars:
1. Compact Commuter – Ideal for city driving, fuel-efficient, and budget-friendly. Example: Toyota Corolla, Honda Civic.
2. Family Explorer SUV – Perfect for family trips with spacious seating for up to 7 passengers. Example: Toyota RAV4, Hyundai Santa Fe.
3. Luxury Cruiser – Designed for traveling in style with premium features. Example: Mercedes-Benz E-Class, BMW 5 Series.
- name: "cancellation"
parameters:
- "Can I cancel my car rental booking at any time?"
expected-output: |
Our cancellation policy states that reservations can be canceled up to 11 days prior to the start of the booking period. If the booking period is less than 4 days, cancellations are not permitted.
- name: "teaching"
parameters:
- "Am I allowed to use the rental car to teach someone how to drive?"
expected-output: |
No, rental cars from Miles of Smiles cannot be used for teaching someone to drive, as outlined in our Terms of Use under “Use of Vehicle.”
- name: "damages"
parameters:
- "What happens if the car is damaged during my rental period?"
expected-output: |
You will be held liable for any damage, loss, or theft that occurs during the rental period, as stated in our Terms of Use under “Liability.”
- name: "requirements"
parameters:
- "What are the requirements for making a car rental booking?"
expected-output: |
To make a booking, you need to provide accurate, current, and complete information during the reservation process. All bookings are also subject to vehicle availability.
- name: "race"
parameters:
- "Can I use the rental car for a race or rally?"
expected-output: |
No, rental cars must not be used for any race, rally, or contest. This is prohibited as per our Terms of Use under “Use of Vehicle.”
- name: "family"
parameters:
- "Do you offer cars suitable for long family trips?"
expected-output: |
Yes, we recommend the Family Explorer SUV for long family trips. It offers spacious seating for up to seven passengers, ample cargo space, and advanced driver-assistance features.
- name: "alcohol"
parameters:
- "Is there any restriction on alcohol consumption while using the rental car?"
expected-output: |
Yes, you are not allowed to drive the rental car while under the influence of alcohol or drugs. This is strictly prohibited as stated in our Terms of Use.
- name: "other questions"
parameters:
- What should I do if I have questions unrelated to car rentals?
expected-output: |
For questions unrelated to car rentals, I recommend contacting the appropriate department. I’m here to assist with any car rental-related inquiries!
- name: "categories"
parameters:
- "Which car category is best for someone who values luxury and comfort?"
expected-output: |
If you value luxury and comfort, the Luxury Cruiser is the perfect choice. It offers premium interiors, cutting-edge technology, and unmatched comfort for a first-class driving experience.
and for the content retriever:
---
- name: cancellation_policy_test
parameters:
- What is the cancellation policy for car rentals?
expected-outputs:
- "Reservations can be cancelled up to 11 days prior to the start of the booking period."
- "If the booking period is less than 4 days, cancellations are not permitted."
- name: vehicle_restrictions_test
parameters:
- What are the restrictions on how the rental car can be used?
expected-outputs:
- "All cars rented from Miles of Smiles must not be used:"
- "for any illegal purpose or in connection with any criminal offense."
- "for teaching someone to drive."
- "in any race, rally or contest."
- "while under the influence of alcohol or drugs."
- name: car_types_test
parameters:
- What types of cars are available for rent?
expected-outputs:
- "Compact Commuter"
- "Perfect for city driving and short commutes, this fuel-efficient and easy-to-park car is your ideal companion for urban adventures"
- "Family Explorer SUV"
- "Designed for road trips, family vacations, or adventures with friends, this spacious and versatile SUV offers ample cargo space, comfortable seating for up to seven passengers"
- "Luxury Cruiser"
- "For those who want to travel in style, the Luxury Cruiser delivers unmatched comfort, cutting-edge technology, and a touch of elegance"
- name: car_damage_liability_test
parameters:
- What happens if I damage the car during my rental period?
expected-outputs:
- "Users will be held liable for any damage, loss, or theft that occurs during the rental period"
- name: governing_law_test
parameters:
- Under what law are the terms and conditions governed?
expected-outputs:
- "These terms will be governed by and construed in accordance with the laws of the United States of America"
- "Any disputes relating to these terms will be subject to the exclusive jurisdiction of the courts of United States"