go-eval

LLM evaluation for Go, inside standard go test.

go-eval v1.0 combines LLM-as-judge metrics, deterministic JSON and artifact checks, typed tool trajectories, structured agent traces, multi-step agent scenarios, grouped contracts, tiered CI slices, repeatability helpers, policy-aware summaries, baseline comparison, static reports, judge calibration, and profile-driven eval operations while keeping the core stdlib-only.

Go-native

Runs through testing.T, benchmarks, subtests, -parallel, and CI.

Agent-aware

Checks turns, tools, artifacts, traces, scenario state, and step contracts.

Ops-ready

Profiles, prerequisites, compare policies, reports, calibration, and JSONL output.

Install

go get github.com/igcodinap/go-eval
go install github.com/igcodinap/go-eval/cmd/goeval@latest

Quick Start

Write evaluation cases using standard Go tests. Case literals should use keyed fields, which is required by v0.4 and later.

package evaltest

import (
	"testing"

	eval "github.com/igcodinap/go-eval"
)

func TestRAGAnswer(t *testing.T) {
	judge := newMyJudge(t)
	r := eval.NewRunner(judge, eval.WithResultSink(eval.DefaultResultSink()))

	c := eval.Case{
		Input:   "What's the capital of France?",
		Output:  myRAG.Answer("What's the capital of France?"),
		Context: []string{"Paris is the capital of France."},
		Metadata: map[string]any{
			"flow": "rag.answer", "tier": "critical", "dataset": "capitals/v1",
		},
	}

	r.Run(t, eval.Faithfulness{Threshold: 0.8}, c)
	r.Run(t, eval.Hallucination{Threshold: 0.9}, c)
}

Run with: GOEVAL=1 go test ./...

CI-safe by default

Without GOEVAL=1, eval runs skip. Use GOEVAL_TRACE=1 only when you need prompt and response logs.

Implementation Example

See how a Go API can wire go-eval into an agent workflow. The travel-planning example covers a goeval.json profile, shared runner options, redacted JSONL results, route artifact contracts, custom metrics, scenario state, tool policies, and compare gates.

Profile

Run critical integration evals with prerequisites and policy settings.

Contract

Validate route artifacts before judging final assistant prose.

Scenario

Exercise a multi-step trip-planning agent with required and forbidden tools.

Open the full implementation example

Metrics

go-eval includes LLM-as-Judge, Deterministic, Trajectory, and Wrapper metrics. Click any metric for a focused example.

MetricTypePurposeThreshold
FaithfulnessLLM-as-JudgeVerify RAG outputs do not contradict retrieved context0.8
HallucinationLLM-as-JudgeCatch outputs that invent facts outside the supplied context0.9
AnswerRelevancyLLM-as-JudgeEnsure the output directly addresses the user input0.7
ContextPrecisionLLM-as-JudgeCheck whether retrieved context documents are relevant to the input0.7
ContextRecallLLM-as-JudgeCheck whether retrieved context contains the expected answer or facts0.7
AnswerCorrectnessLLM-as-JudgeVerify the output matches the expected answer semantically0.7
NoiseSensitivityLLM-as-JudgeEnsure the output ignores irrelevant or distracting retrieved context0.7
TaskCompletionLLM-as-JudgeVerify the agent completed the user task end-to-end0.8
PlanAdherenceLLM-as-JudgeCheck whether the agent followed the expected plan or workflow0.7
GEvalLLM-as-JudgeScore custom criteria that built-in metrics do not cover0.7
CompoundLLM-as-JudgeEvaluate several related rubric dimensions in one judge callper-dimension
ContainsDeterministicCheck that output contains a required substringbinary
RegexDeterministicValidate output against a regular expressionbinary
JSONPathDeterministicAssert a value inside JSON outputbinary
FieldCountDeterministicEnforce a minimum number of non-null JSON fieldsconfig
ArtifactExistsDeterministicCheck that a named structured artifact exists on the casebinary
ArtifactNotExistsDeterministicAssert that an unwanted structured artifact was not emittedbinary
ArtifactJSONPathDeterministicAssert a JSON value inside a named artifactbinary
ArtifactFieldCountDeterministicRequire enough non-null fields inside an artifact objectconfig
ArtifactNumberLTEDeterministicCheck that a numeric artifact value stays under a maximumbinary
ArtifactArrayContainsDeterministicCheck that an artifact array contains an expected valuebinary
ArtifactArrayNotContainsDeterministicCheck that an artifact array excludes an unwanted valuebinary
ArtifactArrayMinLenDeterministicRequire an artifact array to have at least a minimum lengthbinary
ArtifactSubsetDeterministicAssert that an artifact contains a partial expected JSON structurebinary
ToolArgumentAccuracyDeterministicVerify tool names and JSON arguments match expectations1.0
StepEfficiencyDeterministicVerify the trace stays within step and tool-call budgets1.0
OutputLengthBudgetDeterministicKeep final output within rune or word limitsconfig
ToolCallAccuracyTrajectoryCompare actual tool calls with expected calls under a match mode1.0
ToolCallF1TrajectoryReport precision, recall, and F1 for tool-call matches0.8
RequiredToolsTrajectoryFail when required tool names or name patterns are absentbinary
ForbiddenToolTrajectoryFail when disallowed tool names or patterns appear in the trajectorybinary
StepBudgetTrajectoryKeep tool-call count within a configured budgetbinary
PrecheckWrapperSkip expensive LLM metrics when a cheap guard failswrapped metric
ContractWrapperGroup several deterministic or judge checks into one named resultall checks
RepeatWrapperRun a metric multiple times and aggregate pass rate plus score statsconfigurable pass rate
WithTokenBudgetWrapperFail a wrapped metric when token usage exceeds a maximumtoken max
WithLatencyBudgetWrapperFail a wrapped metric when latency exceeds a maximumduration max

Agent Scenarios

Use RunScenario for ordered multi-turn flows where each step can have its own input, tool policy, artifact contract, timeout, state, and repeat pass-rate requirement.

result := r.RunScenario(t, eval.Scenario{
	Name:  "planning_to_route_ready",
	Tier:  "critical",
	State: map[string]any{"locale": "es-CL"},
	Tools: eval.NewToolRegistry("plan_route", "select_map_items"),
	Repeat: eval.ScenarioRepeat{N: 3, PassRate: 2.0 / 3.0},
	Driver: func(ctx context.Context, req eval.StepRequest) (eval.StepResult, error) {
		return runAgentStep(ctx, req.Step.Input, req.History, req.Artifacts, req.State)
	},
	Steps: []eval.Step{
		{
			Name: "greeting", Input: "Hola",
			ForbiddenToolPatterns: []string{"plan_*", "select_*"},
			Timeout: 500 * time.Millisecond,
		},
		{
			Name: "ready_route_request", Input: "Propón la ruta",
			RequiredToolPatterns: []string{"plan_*"},
			Timeout: 3 * time.Second,
			Checks: []eval.Metric{
				eval.NewContract("ready_route",
					eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"},
					eval.ArtifactArrayMinLen{Key: "route", Path: "stops", MinLen: 2},
				),
			},
		},
	},
})

if !result.Passed {
	t.Fatalf("scenario failed")
}

Scenario runs write normal metric rows plus a _scenario_summary JSONL row when a result sink is configured. Use LoadScenarios and BindScenarioDrivers to define scenarios in JSON while keeping drivers in Go.

Grouped Contracts

Contract turns several low-level checks into one named product requirement with per-check dimensions. It is especially useful inside scenario steps.

readyRoute := eval.Contract{
	ContractName: "ready_route",
	Checks: []eval.Metric{
		eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"},
		eval.ArtifactSubset{
			Key:      "route",
			Expected: json.RawMessage(`{"success":true}`),
		},
		eval.OutputLengthBudget{MaxWords: 180},
	},
}

r.Run(t, readyRoute, c)

Artifact Checks

Case.Artifacts stores named structured JSON outputs alongside text output, with absence checks, array exclusion, JSON subset checks, wildcard paths, output length budgets, and normalizers.

c := eval.Case{
	Output: "Route is ready.",
	Artifacts: map[string]json.RawMessage{
		"route": json.RawMessage(`{
			"status":"ready",
			"total_minutes":98,
			"stops":[{"name":"Pajaritos"},{"name":"Valparaíso"}]
		}`),
	},
}

fold := eval.ChainNormalizers(
	eval.CaseFoldNormalizer(),
	eval.SpanishASCIIFoldNormalizer(),
)

r.Run(t, eval.ArtifactJSONPath{
	Key: "route", Path: "status", Expected: "ready",
}, c)
r.Run(t, eval.ArtifactArrayContains{
	Key: "route", Path: "stops[*].name", Expected: "pajaritos", Normalizer: fold,
}, c)
r.Run(t, eval.ArtifactArrayNotContains{
	Key: "route", Path: "stops[*].name", Expected: "Aeropuerto",
}, c)
r.Run(t, eval.ArtifactSubset{
	Key: "route", Expected: json.RawMessage(`{"status":"ready"}`),
}, c)

Trajectory Checks

Use Turn, ToolCall, Case.Turns, and Case.ExpectedToolCalls to evaluate agent tool-use paths without leaving the normal metric pipeline.

c := eval.Case{
	Turns: []eval.Turn{
		{Role: eval.RoleUser, Content: "Where is order 42?"},
		{Role: eval.RoleAssistant, ToolCalls: []eval.ToolCall{
			{Name: "orders.lookup", Arguments: json.RawMessage(`{"order_id":"42"}`)},
		}},
	},
	ExpectedToolCalls: []eval.ToolCall{
		{Name: "orders.lookup", Arguments: json.RawMessage(`{"order_id":"42"}`)},
	},
}

r.Run(t, eval.ToolCallAccuracy{Mode: eval.MatchStrict, MatchArgs: true}, c)
r.Run(t, eval.ToolCallF1{MatchArgs: true, Threshold: 0.8}, c)
r.Run(t, eval.RequiredTools{Patterns: []string{"orders.*"}}, c)
r.Run(t, eval.ForbiddenTool{Patterns: []string{"orders.refund*"}}, c)
r.Run(t, eval.StepBudget{MaxSteps: 1}, c)

Match modes are MatchStrict, MatchUnordered, MatchSubset, and MatchSuperset. JSON datasets can include optional turns and expected_tool_calls fields.

Structured Traces

Use Case.Trace when your agent can emit structured spans, tool calls, artifact records, or state deltas. Trace IDs link metric rows, scenario summaries, and trace records in downstream reports.

r := eval.NewRunner(
	judge,
	eval.WithResultSink(eval.DefaultResultSink()),
	eval.WithTraceSink(eval.DefaultTraceSink()),
)

c := eval.Case{
	Input:   "Find a route and charge the card",
	Output:  answer,
	TraceID: "route-42",
	Trace: &eval.Trace{
		ID:   "route-42",
		Name: "checkout_route",
		Spans: []eval.Span{{
			Name: "charge",
			Kind: "tool_call",
			ToolCall: &eval.ToolCall{
				Name:      "payments.charge",
				Arguments: json.RawMessage(`{"amount":42}`),
			},
		}},
	},
}

When GOEVAL_RESULTS_DIR is set, DefaultTraceSink writes traces.jsonl alongside results.jsonl. Trace writes use the same WithRedactors hooks as result JSONL.

Benchmarks

Track latency, token usage, and score quality across prompt or model changes using standard Go benchmarks and benchstat.

func BenchmarkRAGLatency(b *testing.B) {
	r := eval.NewRunner(newMyJudge(b))
	c := eval.Case{Input: "...", Output: "...", Context: docs}

	eval.Bench(b, r, eval.Faithfulness{Threshold: 0.8}, c)
}
ns/opLatency per judge call
tokens/opMean tokens consumed per call
score_meanAverage score across iterations
score_stddevScore consistency across runs

Eval Operations

v1.0 adds an operations layer for repeatable eval runs: define goeval.json profiles, preflight prerequisites, run profile-aware tests, and apply the same policy to compare and summarize commands.

Profiles

Name PR, nightly, provider, or release-gate run shapes once.

Prerequisites

Require env vars, files, TCP endpoints, or custom checks before a run.

Compare policies

Set score tolerances, case IDs, and regression behavior in config.

Reliability

Summarize pass rates, p95 latency/tokens, scenario totals, and flaky identities.

{
  "profiles": {
    "pr": {
      "packages": ["./..."],
      "tiers": ["critical"],
      "results_dir": ".goeval/pr"
    },
    "google": {
      "packages": ["./..."],
      "tiers": ["critical", "standard"],
      "results_dir": ".goeval/google",
      "prerequisites": [
        {"type": "env", "name": "GEMINI_API_KEY"},
        {"type": "env", "name": "GOOGLE_ROUTES_API_KEY"}
      ],
      "missing_prerequisite": "skip"
    }
  },
  "compare": {
    "case_id_key": "case_id",
    "default": {
      "score_tolerance": 0.02,
      "fail_on_missing": true,
      "fail_on_regression": true
    }
  }
}
goeval test --profile pr
goeval test --profile google --config goeval.json -run Route
goeval compare --policy goeval.json --format json old/results.jsonl new/results.jsonl
goeval compare --fail-on-regression=false old/results.jsonl new/results.jsonl
goeval summarize --policy goeval.json new/results.jsonl

Reports And Calibration

Render static HTML, Markdown, or JSON reports from JSONL result files. Use calibration to analyze judge disagreement and compare A/B variants.

Static Reports

Render HTML, Markdown, or JSON reports from one or two result files.

Judge Calibration

Analyze judge disagreement, aggregate duplicate rows, and compare A/B variants.

goeval report current/results.jsonl --out report.html
goeval report --baseline old/results.jsonl --current new/results.jsonl --format markdown
goeval calibrate --case-id-key case_id --judge-key judge current/results.jsonl
goeval calibrate --pairwise-key variant results.jsonl

CI/CD

Persist JSONL results, run named profiles, compare baselines with policy tolerances, summarize reliability, redact sensitive metadata, and filter case tiers while keeping normal CI fast by default.

Install DefaultTierFilter on the runner to use GOEVAL_TIER, declare run prerequisites in goeval.json or with eval.Require, and add WithRedactors before writing shared result logs.

goeval test --profile pr
goeval compare --policy goeval.json old/results.jsonl new/results.jsonl
goeval summarize --policy goeval.json .goeval/pr/results.jsonl

Environment Variables

  • GOEVAL=1 - Enable evaluations
  • GOEVAL_TRACE=1 - Log judge prompts and responses via t.Log
  • GOEVAL_TIER - Filter tiers when DefaultTierFilter is installed
  • GOEVAL_RESULTS_DIR - Write results.jsonl in this directory

Judge Adapters

Optional judge adapters live in separate modules so the core package stays stdlib-only. Use the Ollama adapter for local LLM-as-judge scoring, or the OpenAI adapter for cloud-based evaluation.

go get github.com/igcodinap/go-eval/adapters/ollama
go get github.com/igcodinap/go-eval/adapters/openai github.com/sashabaranov/go-openai
import ollamaeval "github.com/igcodinap/go-eval/adapters/ollama"

judge := ollamaeval.NewJudge("llama3.2")
r := eval.NewRunner(judge)

r.Run(t, eval.Faithfulness{Threshold: 0.8}, eval.Case{
	Input:   "What is the capital of France?",
	Output:  "Paris is the capital of France.",
	Context: []string{"Paris is the capital of France."},
})

You can also implement your own Judge by wrapping any LLM provider:

type MyJudge struct{}

func (j *MyJudge) Evaluate(ctx context.Context, prompt string) (eval.JudgeResponse, error) {
	// 1. Send prompt to an LLM.
	// 2. Parse its JSON {"score": float, "reason": string} response.
	// 3. Return eval.JudgeResponse{Score, Reason, Tokens}.
	// Must be safe for concurrent use.
	return eval.JudgeResponse{}, nil
}

CLI

The optional goeval CLI wraps common test, profile, compare, and summary workflows.

goeval test

Run a named goeval.json profile with GOEVAL=1, tier filters, result directories, and prerequisites applied.

goeval test --profile pr

goeval compare

Compare baseline and current JSONL results with policy tolerances, case IDs, and regression rules.

goeval compare --policy goeval.json old/results.jsonl new/results.jsonl

goeval summarize

Summarize pass rates, p95 latency/tokens, metadata groups, scenario totals, and flaky identities.

goeval summarize --policy goeval.json current/results.jsonl

goeval report

Render static HTML, Markdown, or JSON reports from JSONL result files.

goeval report current/results.jsonl --out report.html

goeval calibrate

Analyze judge disagreement, aggregate duplicate rows, and compare A/B variants.

goeval calibrate --judge-key judge current/results.jsonl

goeval version

Print CLI version information.

goeval version

Core Concepts

Case
Input, output, expected value, context, artifacts, turns, traces, expected tool calls, metadata, and timeout.
Scenario
Ordered multi-step agent flow with history, artifacts, state, tools, and repeats.
Contract
A named group of checks reported as one business-level pass/fail result.
Artifacts
Named structured JSON outputs for deterministic workflow checks.
Trajectory
Typed turns and tool calls for agent path evaluation.
Trace
Structured agent execution with spans, tool calls, artifact records, and state deltas.
Metric
A stateless scoring function with thresholded pass/fail behavior.
Precheck
Conditional wrapper that gates expensive metrics behind cheap checks.
Repeat
Wrapper for repeated runs, pass-rate aggregation, and score variance.
Eval Profiles
Named goeval.json run shapes for packages, tiers, results, and prerequisites.
Prerequisite Checks
Env, file, TCP, or custom checks that can skip or fail a profile before go test runs.
Compare Policies
Policies for score tolerance, stable identity, and regression behavior.
Reliability Summaries
Pass rates, p95 latency/tokens, scenario totals, metadata groups, and flaky identities.
Reports
Static HTML, Markdown, or JSON evaluation reports from JSONL result files.
Calibration
Judge disagreement analysis and A/B variant comparison for eval reliability.
Scenario Datasets
Portable JSON scenario definitions with named drivers bound in Go.
Stable Case IDs
Case metadata identities that survive test renames across result comparisons.
TierFilter
GOEVAL_TIER-driven case slicing when DefaultTierFilter is installed.
Normalizer
String comparison hook for deterministic checks where case or accents vary.
Judge
Concurrency-safe LLM-as-judge implementation returning scores and reasons.
Runner
Executes cases with metrics, handles GOEVAL gating, assertions, and result sinks.
CaseMetadata
Standard keys such as flow, tier, and dataset for filtering and reports.
MockJudge
Scripted judge for tests that should not call an LLM.