Skip to main content

Agentic Evaluation Testing

Green Goods still uses agentic evaluation, but the runnable benchmark-pack layer is intentionally small. The current harness combines product acceptance cases, guidance-consistency checks, targeted test loops, and human review instead of maintaining separate model benchmark packs for retired agents.

What It Checks

Active Eval Surfaces

SurfacePurposeStatus
.claude/evals/acceptance/Product acceptance cases and workflow-level checksLive
node .claude/scripts/check-guidance-consistency.jsGuidance drift detection across committed agent surfacesLive
.github/workflows/claude-agent-evals.ymlCI sync check for the committed agent/eval surfaceLive

The only live eval directory under .claude/evals/ today is acceptance/. Automated benchmark packs for triage, code-reviewer, oracle, and cracked-coder are retired, and the current CI workflow checks that this repo stays aligned with that retirement state. The committed agent surface is currently cracked-coder plus oracle; specialization otherwise routes through skills and plan-hub lanes.

Model Selection

Model choice still matters for judgment-heavy work:

  • Opus -- suited to implementation, review, and architecture judgment
  • Sonnet -- suited to straightforward lookups and mechanical transforms
  • Haiku -- keep for trivial routing or small deterministic work, not review

Evaluation Criteria

Change Quality

Agent-assisted changes are evaluated against these criteria:

  1. Grounding -- every finding or fix should point to specific files, tests, or plan artifacts
  2. False positive rate -- findings should be rare, specific, and anchored in current repo surfaces rather than retired workflows
  3. Actionability -- findings must suggest a concrete next step, not just a label
  4. Context awareness -- the agent must read surrounding code, current docs, and feature-hub state before claiming drift or failure

Product Acceptance Quality

When feature work needs user-level verification, consult the acceptance cases in .claude/evals/acceptance/:

  • user stories should map to concrete product behavior
  • acceptance cases complement code-level heuristics rather than replacing them
  • passing targeted tests does not guarantee the workflow matches product intent

How It's Configured

The Three-Strike Protocol

If an agent fails to fix an issue after three attempts:

  1. Strike 1 -- Reassess assumptions. Is the test failing for the right reason?
  2. Strike 2 -- Question the architecture. Is there a fundamentally different approach?
  3. Strike 3 -- Stop and escalate. Document what was tried and what the agent's hypothesis was.

This prevents agents from burning context window on unproductive loops.

Guidance Consistency

The check-guidance-consistency.js script validates that agent instructions across CLAUDE.md, AGENTS.md, .claude/agents/, and .claude/rules/ do not contradict each other:

node .claude/scripts/check-guidance-consistency.js

This runs in CI via .github/workflows/claude-guidance.yml to catch drift between guidance files.

Inner-Loop Policy

For iterative agent work, use the fastest honest loop:

  • targeted bun run test -- <file> while shaping a change
  • bash scripts/check-test-quality.sh when touching test governance
  • broader package or repo gates only once the local loop is green

Coverage remains a scheduled floor on package CI and pre-merge validation, not the per-change inner loop.

Running & Troubleshooting

Eval Surface Sync

The claude-agent-evals.yml workflow no longer runs model benchmark packs. It now verifies that:

  1. the committed .claude/agents/ surface matches the documented current agents
  2. .claude/evals/ only contains the live acceptance pack

Diff-scoped automated review still happens through the dedicated review workflow and human validation, not through retired benchmark suites.

Lessons Learned

  • Repo-truth drift is more dangerous than missing a benchmark pack. Keep docs, workflows, and committed agent surfaces aligned.
  • Targeted loops beat blanket coverage during active implementation. Use broader coverage only once the scoped loop is already green.
  • Acceptance cases are a useful backstop for product intent, especially when code-level checks pass but the user workflow still feels off.
  • Context window management matters. Long sessions can checkpoint to session-state.md and tests.json, but .plans/ remains the durable repo truth.

Resources

  • Husky Git Hooks -- Local quality gates that run before code reaches the repository
  • Regression Testing -- Regression suites that agents help maintain
  • GitHub Actions -- CI pipeline including the eval surface sync
  • Test Cases -- Test case strategy that agents follow during TDD
  • Agent specs: .claude/agents/*.md
  • Guidance consistency script: .claude/scripts/check-guidance-consistency.js

Next page

Next best action

See how git hooks enforce code quality gates before code reaches the repository.

Husky Git Hooks