go-eval
LLM evaluation for Go, inside standard go test.
go-eval v1.0 combines LLM-as-judge metrics, deterministic JSON and artifact checks, typed tool trajectories, structured agent traces, multi-step agent scenarios, grouped contracts, tiered CI slices, repeatability helpers, policy-aware summaries, baseline comparison, static reports, judge calibration, and profile-driven eval operations while keeping the core stdlib-only.
Go-native
Runs through testing.T, benchmarks, subtests, -parallel, and CI.
Agent-aware
Checks turns, tools, artifacts, traces, scenario state, and step contracts.
Ops-ready
Profiles, prerequisites, compare policies, reports, calibration, and JSONL output.
Install
go get github.com/igcodinap/go-eval
go install github.com/igcodinap/go-eval/cmd/goeval@latestQuick Start
Write evaluation cases using standard Go tests. Case literals should use keyed fields, which is required by v0.4 and later.
package evaltest
import (
"testing"
eval "github.com/igcodinap/go-eval"
)
func TestRAGAnswer(t *testing.T) {
judge := newMyJudge(t)
r := eval.NewRunner(judge, eval.WithResultSink(eval.DefaultResultSink()))
c := eval.Case{
Input: "What's the capital of France?",
Output: myRAG.Answer("What's the capital of France?"),
Context: []string{"Paris is the capital of France."},
Metadata: map[string]any{
"flow": "rag.answer", "tier": "critical", "dataset": "capitals/v1",
},
}
r.Run(t, eval.Faithfulness{Threshold: 0.8}, c)
r.Run(t, eval.Hallucination{Threshold: 0.9}, c)
}Run with: GOEVAL=1 go test ./...
CI-safe by default
Without GOEVAL=1, eval runs skip. Use GOEVAL_TRACE=1 only when you need prompt and response logs.
Implementation Example
See how a Go API can wire go-eval into an agent workflow. The travel-planning example covers a goeval.json profile, shared runner options, redacted JSONL results, route artifact contracts, custom metrics, scenario state, tool policies, and compare gates.
Profile
Run critical integration evals with prerequisites and policy settings.
Contract
Validate route artifacts before judging final assistant prose.
Scenario
Exercise a multi-step trip-planning agent with required and forbidden tools.
Metrics
go-eval includes LLM-as-Judge, Deterministic, Trajectory, and Wrapper metrics. Click any metric for a focused example.
| Metric | Type | Purpose | Threshold |
|---|---|---|---|
| Faithfulness | LLM-as-Judge | Verify RAG outputs do not contradict retrieved context | 0.8 |
| Hallucination | LLM-as-Judge | Catch outputs that invent facts outside the supplied context | 0.9 |
| AnswerRelevancy | LLM-as-Judge | Ensure the output directly addresses the user input | 0.7 |
| ContextPrecision | LLM-as-Judge | Check whether retrieved context documents are relevant to the input | 0.7 |
| ContextRecall | LLM-as-Judge | Check whether retrieved context contains the expected answer or facts | 0.7 |
| AnswerCorrectness | LLM-as-Judge | Verify the output matches the expected answer semantically | 0.7 |
| NoiseSensitivity | LLM-as-Judge | Ensure the output ignores irrelevant or distracting retrieved context | 0.7 |
| TaskCompletion | LLM-as-Judge | Verify the agent completed the user task end-to-end | 0.8 |
| PlanAdherence | LLM-as-Judge | Check whether the agent followed the expected plan or workflow | 0.7 |
| GEval | LLM-as-Judge | Score custom criteria that built-in metrics do not cover | 0.7 |
| Compound | LLM-as-Judge | Evaluate several related rubric dimensions in one judge call | per-dimension |
| Contains | Deterministic | Check that output contains a required substring | binary |
| Regex | Deterministic | Validate output against a regular expression | binary |
| JSONPath | Deterministic | Assert a value inside JSON output | binary |
| FieldCount | Deterministic | Enforce a minimum number of non-null JSON fields | config |
| ArtifactExists | Deterministic | Check that a named structured artifact exists on the case | binary |
| ArtifactNotExists | Deterministic | Assert that an unwanted structured artifact was not emitted | binary |
| ArtifactJSONPath | Deterministic | Assert a JSON value inside a named artifact | binary |
| ArtifactFieldCount | Deterministic | Require enough non-null fields inside an artifact object | config |
| ArtifactNumberLTE | Deterministic | Check that a numeric artifact value stays under a maximum | binary |
| ArtifactArrayContains | Deterministic | Check that an artifact array contains an expected value | binary |
| ArtifactArrayNotContains | Deterministic | Check that an artifact array excludes an unwanted value | binary |
| ArtifactArrayMinLen | Deterministic | Require an artifact array to have at least a minimum length | binary |
| ArtifactSubset | Deterministic | Assert that an artifact contains a partial expected JSON structure | binary |
| ToolArgumentAccuracy | Deterministic | Verify tool names and JSON arguments match expectations | 1.0 |
| StepEfficiency | Deterministic | Verify the trace stays within step and tool-call budgets | 1.0 |
| OutputLengthBudget | Deterministic | Keep final output within rune or word limits | config |
| ToolCallAccuracy | Trajectory | Compare actual tool calls with expected calls under a match mode | 1.0 |
| ToolCallF1 | Trajectory | Report precision, recall, and F1 for tool-call matches | 0.8 |
| RequiredTools | Trajectory | Fail when required tool names or name patterns are absent | binary |
| ForbiddenTool | Trajectory | Fail when disallowed tool names or patterns appear in the trajectory | binary |
| StepBudget | Trajectory | Keep tool-call count within a configured budget | binary |
| Precheck | Wrapper | Skip expensive LLM metrics when a cheap guard fails | wrapped metric |
| Contract | Wrapper | Group several deterministic or judge checks into one named result | all checks |
| Repeat | Wrapper | Run a metric multiple times and aggregate pass rate plus score stats | configurable pass rate |
| WithTokenBudget | Wrapper | Fail a wrapped metric when token usage exceeds a maximum | token max |
| WithLatencyBudget | Wrapper | Fail a wrapped metric when latency exceeds a maximum | duration max |
Agent Scenarios
Use RunScenario for ordered multi-turn flows where each step can have its own input, tool policy, artifact contract, timeout, state, and repeat pass-rate requirement.
result := r.RunScenario(t, eval.Scenario{
Name: "planning_to_route_ready",
Tier: "critical",
State: map[string]any{"locale": "es-CL"},
Tools: eval.NewToolRegistry("plan_route", "select_map_items"),
Repeat: eval.ScenarioRepeat{N: 3, PassRate: 2.0 / 3.0},
Driver: func(ctx context.Context, req eval.StepRequest) (eval.StepResult, error) {
return runAgentStep(ctx, req.Step.Input, req.History, req.Artifacts, req.State)
},
Steps: []eval.Step{
{
Name: "greeting", Input: "Hola",
ForbiddenToolPatterns: []string{"plan_*", "select_*"},
Timeout: 500 * time.Millisecond,
},
{
Name: "ready_route_request", Input: "Propón la ruta",
RequiredToolPatterns: []string{"plan_*"},
Timeout: 3 * time.Second,
Checks: []eval.Metric{
eval.NewContract("ready_route",
eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"},
eval.ArtifactArrayMinLen{Key: "route", Path: "stops", MinLen: 2},
),
},
},
},
})
if !result.Passed {
t.Fatalf("scenario failed")
}Scenario runs write normal metric rows plus a _scenario_summary JSONL row when a result sink is configured. Use LoadScenarios and BindScenarioDrivers to define scenarios in JSON while keeping drivers in Go.
Grouped Contracts
Contract turns several low-level checks into one named product requirement with per-check dimensions. It is especially useful inside scenario steps.
readyRoute := eval.Contract{
ContractName: "ready_route",
Checks: []eval.Metric{
eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"},
eval.ArtifactSubset{
Key: "route",
Expected: json.RawMessage(`{"success":true}`),
},
eval.OutputLengthBudget{MaxWords: 180},
},
}
r.Run(t, readyRoute, c)Artifact Checks
Case.Artifacts stores named structured JSON outputs alongside text output, with absence checks, array exclusion, JSON subset checks, wildcard paths, output length budgets, and normalizers.
c := eval.Case{
Output: "Route is ready.",
Artifacts: map[string]json.RawMessage{
"route": json.RawMessage(`{
"status":"ready",
"total_minutes":98,
"stops":[{"name":"Pajaritos"},{"name":"Valparaíso"}]
}`),
},
}
fold := eval.ChainNormalizers(
eval.CaseFoldNormalizer(),
eval.SpanishASCIIFoldNormalizer(),
)
r.Run(t, eval.ArtifactJSONPath{
Key: "route", Path: "status", Expected: "ready",
}, c)
r.Run(t, eval.ArtifactArrayContains{
Key: "route", Path: "stops[*].name", Expected: "pajaritos", Normalizer: fold,
}, c)
r.Run(t, eval.ArtifactArrayNotContains{
Key: "route", Path: "stops[*].name", Expected: "Aeropuerto",
}, c)
r.Run(t, eval.ArtifactSubset{
Key: "route", Expected: json.RawMessage(`{"status":"ready"}`),
}, c)Trajectory Checks
Use Turn, ToolCall, Case.Turns, and Case.ExpectedToolCalls to evaluate agent tool-use paths without leaving the normal metric pipeline.
c := eval.Case{
Turns: []eval.Turn{
{Role: eval.RoleUser, Content: "Where is order 42?"},
{Role: eval.RoleAssistant, ToolCalls: []eval.ToolCall{
{Name: "orders.lookup", Arguments: json.RawMessage(`{"order_id":"42"}`)},
}},
},
ExpectedToolCalls: []eval.ToolCall{
{Name: "orders.lookup", Arguments: json.RawMessage(`{"order_id":"42"}`)},
},
}
r.Run(t, eval.ToolCallAccuracy{Mode: eval.MatchStrict, MatchArgs: true}, c)
r.Run(t, eval.ToolCallF1{MatchArgs: true, Threshold: 0.8}, c)
r.Run(t, eval.RequiredTools{Patterns: []string{"orders.*"}}, c)
r.Run(t, eval.ForbiddenTool{Patterns: []string{"orders.refund*"}}, c)
r.Run(t, eval.StepBudget{MaxSteps: 1}, c)Match modes are MatchStrict, MatchUnordered, MatchSubset, and MatchSuperset. JSON datasets can include optional turns and expected_tool_calls fields.
Structured Traces
Use Case.Trace when your agent can emit structured spans, tool calls, artifact records, or state deltas. Trace IDs link metric rows, scenario summaries, and trace records in downstream reports.
r := eval.NewRunner(
judge,
eval.WithResultSink(eval.DefaultResultSink()),
eval.WithTraceSink(eval.DefaultTraceSink()),
)
c := eval.Case{
Input: "Find a route and charge the card",
Output: answer,
TraceID: "route-42",
Trace: &eval.Trace{
ID: "route-42",
Name: "checkout_route",
Spans: []eval.Span{{
Name: "charge",
Kind: "tool_call",
ToolCall: &eval.ToolCall{
Name: "payments.charge",
Arguments: json.RawMessage(`{"amount":42}`),
},
}},
},
}When GOEVAL_RESULTS_DIR is set, DefaultTraceSink writes traces.jsonl alongside results.jsonl. Trace writes use the same WithRedactors hooks as result JSONL.
Benchmarks
Track latency, token usage, and score quality across prompt or model changes using standard Go benchmarks and benchstat.
func BenchmarkRAGLatency(b *testing.B) {
r := eval.NewRunner(newMyJudge(b))
c := eval.Case{Input: "...", Output: "...", Context: docs}
eval.Bench(b, r, eval.Faithfulness{Threshold: 0.8}, c)
}ns/opLatency per judge calltokens/opMean tokens consumed per callscore_meanAverage score across iterationsscore_stddevScore consistency across runsEval Operations
v1.0 adds an operations layer for repeatable eval runs: define goeval.json profiles, preflight prerequisites, run profile-aware tests, and apply the same policy to compare and summarize commands.
Profiles
Name PR, nightly, provider, or release-gate run shapes once.
Prerequisites
Require env vars, files, TCP endpoints, or custom checks before a run.
Compare policies
Set score tolerances, case IDs, and regression behavior in config.
Reliability
Summarize pass rates, p95 latency/tokens, scenario totals, and flaky identities.
{
"profiles": {
"pr": {
"packages": ["./..."],
"tiers": ["critical"],
"results_dir": ".goeval/pr"
},
"google": {
"packages": ["./..."],
"tiers": ["critical", "standard"],
"results_dir": ".goeval/google",
"prerequisites": [
{"type": "env", "name": "GEMINI_API_KEY"},
{"type": "env", "name": "GOOGLE_ROUTES_API_KEY"}
],
"missing_prerequisite": "skip"
}
},
"compare": {
"case_id_key": "case_id",
"default": {
"score_tolerance": 0.02,
"fail_on_missing": true,
"fail_on_regression": true
}
}
}goeval test --profile pr
goeval test --profile google --config goeval.json -run Route
goeval compare --policy goeval.json --format json old/results.jsonl new/results.jsonl
goeval compare --fail-on-regression=false old/results.jsonl new/results.jsonl
goeval summarize --policy goeval.json new/results.jsonlReports And Calibration
Render static HTML, Markdown, or JSON reports from JSONL result files. Use calibration to analyze judge disagreement and compare A/B variants.
Static Reports
Render HTML, Markdown, or JSON reports from one or two result files.
Judge Calibration
Analyze judge disagreement, aggregate duplicate rows, and compare A/B variants.
goeval report current/results.jsonl --out report.html
goeval report --baseline old/results.jsonl --current new/results.jsonl --format markdown
goeval calibrate --case-id-key case_id --judge-key judge current/results.jsonl
goeval calibrate --pairwise-key variant results.jsonlCI/CD
Persist JSONL results, run named profiles, compare baselines with policy tolerances, summarize reliability, redact sensitive metadata, and filter case tiers while keeping normal CI fast by default.
Install DefaultTierFilter on the runner to use GOEVAL_TIER, declare run prerequisites in goeval.json or with eval.Require, and add WithRedactors before writing shared result logs.
goeval test --profile pr
goeval compare --policy goeval.json old/results.jsonl new/results.jsonl
goeval summarize --policy goeval.json .goeval/pr/results.jsonlEnvironment Variables
GOEVAL=1- Enable evaluationsGOEVAL_TRACE=1- Log judge prompts and responses viat.LogGOEVAL_TIER- Filter tiers whenDefaultTierFilteris installedGOEVAL_RESULTS_DIR- Writeresults.jsonlin this directory
Judge Adapters
Optional judge adapters live in separate modules so the core package stays stdlib-only. Use the Ollama adapter for local LLM-as-judge scoring, or the OpenAI adapter for cloud-based evaluation.
go get github.com/igcodinap/go-eval/adapters/ollama
go get github.com/igcodinap/go-eval/adapters/openai github.com/sashabaranov/go-openaiimport ollamaeval "github.com/igcodinap/go-eval/adapters/ollama"
judge := ollamaeval.NewJudge("llama3.2")
r := eval.NewRunner(judge)
r.Run(t, eval.Faithfulness{Threshold: 0.8}, eval.Case{
Input: "What is the capital of France?",
Output: "Paris is the capital of France.",
Context: []string{"Paris is the capital of France."},
})You can also implement your own Judge by wrapping any LLM provider:
type MyJudge struct{}
func (j *MyJudge) Evaluate(ctx context.Context, prompt string) (eval.JudgeResponse, error) {
// 1. Send prompt to an LLM.
// 2. Parse its JSON {"score": float, "reason": string} response.
// 3. Return eval.JudgeResponse{Score, Reason, Tokens}.
// Must be safe for concurrent use.
return eval.JudgeResponse{}, nil
}CLI
The optional goeval CLI wraps common test, profile, compare, and summary workflows.
goeval test
Run a named goeval.json profile with GOEVAL=1, tier filters, result directories, and prerequisites applied.
goeval test --profile prgoeval compare
Compare baseline and current JSONL results with policy tolerances, case IDs, and regression rules.
goeval compare --policy goeval.json old/results.jsonl new/results.jsonlgoeval summarize
Summarize pass rates, p95 latency/tokens, metadata groups, scenario totals, and flaky identities.
goeval summarize --policy goeval.json current/results.jsonlgoeval report
Render static HTML, Markdown, or JSON reports from JSONL result files.
goeval report current/results.jsonl --out report.htmlgoeval calibrate
Analyze judge disagreement, aggregate duplicate rows, and compare A/B variants.
goeval calibrate --judge-key judge current/results.jsonlgoeval version
Print CLI version information.
goeval version