Agentic Evaluation Testing
Green Goods still uses agentic evaluation, but the runnable benchmark-pack layer is intentionally small. The current harness combines product acceptance cases, guidance-consistency checks, targeted test loops, and human review instead of maintaining separate model benchmark packs for retired agents.
What It Checks
Active Eval Surfaces
| Surface | Purpose | Status |
|---|---|---|
.claude/evals/acceptance/ | Product acceptance cases and workflow-level checks | Live |
node .claude/scripts/check-guidance-consistency.js | Guidance drift detection across committed agent surfaces | Live |
.github/workflows/claude-agent-evals.yml | CI sync check for the committed agent/eval surface | Live |
The only live eval directory under .claude/evals/ today is acceptance/. Automated benchmark packs for triage, code-reviewer, oracle, and cracked-coder are retired, and the current CI workflow checks that this repo stays aligned with that retirement state. The committed agent surface is currently cracked-coder plus oracle; specialization otherwise routes through skills and plan-hub lanes.
Model Selection
Model choice still matters for judgment-heavy work:
- Opus -- suited to implementation, review, and architecture judgment
- Sonnet -- suited to straightforward lookups and mechanical transforms
- Haiku -- keep for trivial routing or small deterministic work, not review
Evaluation Criteria
Change Quality
Agent-assisted changes are evaluated against these criteria:
- Grounding -- every finding or fix should point to specific files, tests, or plan artifacts
- False positive rate -- findings should be rare, specific, and anchored in current repo surfaces rather than retired workflows
- Actionability -- findings must suggest a concrete next step, not just a label
- Context awareness -- the agent must read surrounding code, current docs, and feature-hub state before claiming drift or failure
Product Acceptance Quality
When feature work needs user-level verification, consult the acceptance cases in .claude/evals/acceptance/:
- user stories should map to concrete product behavior
- acceptance cases complement code-level heuristics rather than replacing them
- passing targeted tests does not guarantee the workflow matches product intent
How It's Configured
The Three-Strike Protocol
If an agent fails to fix an issue after three attempts:
- Strike 1 -- Reassess assumptions. Is the test failing for the right reason?
- Strike 2 -- Question the architecture. Is there a fundamentally different approach?
- Strike 3 -- Stop and escalate. Document what was tried and what the agent's hypothesis was.
This prevents agents from burning context window on unproductive loops.
Guidance Consistency
The check-guidance-consistency.js script validates that agent instructions across CLAUDE.md, AGENTS.md, .claude/agents/, and .claude/rules/ do not contradict each other:
node .claude/scripts/check-guidance-consistency.js
This runs in CI via .github/workflows/claude-guidance.yml to catch drift between guidance files.
Inner-Loop Policy
For iterative agent work, use the fastest honest loop:
- targeted
bun run test -- <file>while shaping a change bash scripts/check-test-quality.shwhen touching test governance- broader package or repo gates only once the local loop is green
Coverage remains a scheduled floor on package CI and pre-merge validation, not the per-change inner loop.
Running & Troubleshooting
Eval Surface Sync
The claude-agent-evals.yml workflow no longer runs model benchmark packs. It now verifies that:
- the committed
.claude/agents/surface matches the documented current agents .claude/evals/only contains the live acceptance pack
Diff-scoped automated review still happens through the dedicated review workflow and human validation, not through retired benchmark suites.
Lessons Learned
- Repo-truth drift is more dangerous than missing a benchmark pack. Keep docs, workflows, and committed agent surfaces aligned.
- Targeted loops beat blanket coverage during active implementation. Use broader coverage only once the scoped loop is already green.
- Acceptance cases are a useful backstop for product intent, especially when code-level checks pass but the user workflow still feels off.
- Context window management matters. Long sessions can checkpoint to
session-state.mdandtests.json, but.plans/remains the durable repo truth.
Resources
- Husky Git Hooks -- Local quality gates that run before code reaches the repository
- Regression Testing -- Regression suites that agents help maintain
- GitHub Actions -- CI pipeline including the eval surface sync
- Test Cases -- Test case strategy that agents follow during TDD
- Agent specs:
.claude/agents/*.md - Guidance consistency script:
.claude/scripts/check-guidance-consistency.js
Next page
Next best action
See how git hooks enforce code quality gates before code reaches the repository.
Husky Git Hooks