Agentic Evaluation Testing

Green Goods still uses agentic evaluation, but the runnable benchmark-pack layer is intentionally small. The current harness combines product acceptance cases, guidance-consistency checks, targeted test loops, and human review instead of maintaining separate model benchmark packs for retired agents.

What It Checks

Active Eval Surfaces

Surface	Purpose	Status
`.claude/evals/acceptance/`	Product acceptance cases and workflow-level checks	Live
`node .claude/scripts/check-guidance-consistency.js`	Guidance drift detection across committed agent surfaces	Live
`.github/workflows/claude-agent-evals.yml`	CI sync check for the committed agent/eval surface	Live

The only live eval directory under .claude/evals/ today is acceptance/. Automated benchmark packs for triage, code-reviewer, oracle, and cracked-coder are retired, and the current CI workflow checks that this repo stays aligned with that retirement state. The committed agent surface is currently cracked-coder plus oracle; specialization otherwise routes through skills and plan-hub lanes.

Model Selection

Model choice still matters for judgment-heavy work:

Opus -- suited to implementation, review, and architecture judgment
Sonnet -- suited to straightforward lookups and mechanical transforms
Haiku -- keep for trivial routing or small deterministic work, not review

Evaluation Criteria

Change Quality

Agent-assisted changes are evaluated against these criteria:

Grounding -- every finding or fix should point to specific files, tests, or plan artifacts
False positive rate -- findings should be rare, specific, and anchored in current repo surfaces rather than retired workflows
Actionability -- findings must suggest a concrete next step, not just a label
Context awareness -- the agent must read surrounding code, current docs, and feature-hub state before claiming drift or failure

Product Acceptance Quality

When feature work needs user-level verification, consult the acceptance cases in .claude/evals/acceptance/:

user stories should map to concrete product behavior
acceptance cases complement code-level heuristics rather than replacing them
passing targeted tests does not guarantee the workflow matches product intent

How It's Configured

The Three-Strike Protocol

If an agent fails to fix an issue after three attempts:

Strike 1 -- Reassess assumptions. Is the test failing for the right reason?
Strike 2 -- Question the architecture. Is there a fundamentally different approach?
Strike 3 -- Stop and escalate. Document what was tried and what the agent's hypothesis was.

This prevents agents from burning context window on unproductive loops.

Guidance Consistency

The check-guidance-consistency.js script validates that agent instructions across CLAUDE.md, AGENTS.md, .claude/agents/, and .claude/rules/ do not contradict each other:

node .claude/scripts/check-guidance-consistency.js

This runs in CI via .github/workflows/claude-guidance.yml to catch drift between guidance files.

Inner-Loop Policy

For iterative agent work, use the fastest honest loop:

targeted bun run test -- <file> while shaping a change
bash scripts/check-test-quality.sh when touching test governance
broader package or repo gates only once the local loop is green

Coverage remains a scheduled floor on package CI and pre-merge validation, not the per-change inner loop.

Running & Troubleshooting

Eval Surface Sync

The claude-agent-evals.yml workflow no longer runs model benchmark packs. It now verifies that:

the committed .claude/agents/ surface matches the documented current agents
.claude/evals/ only contains the live acceptance pack

Diff-scoped automated review still happens through the dedicated review workflow and human validation, not through retired benchmark suites.

Lessons Learned

Repo-truth drift is more dangerous than missing a benchmark pack. Keep docs, workflows, and committed agent surfaces aligned.
Targeted loops beat blanket coverage during active implementation. Use broader coverage only once the scoped loop is already green.
Acceptance cases are a useful backstop for product intent, especially when code-level checks pass but the user workflow still feels off.
Context window management matters. Long sessions can checkpoint to session-state.md and tests.json, but .plans/ remains the durable repo truth.

Resources

Husky Git Hooks -- Local quality gates that run before code reaches the repository
Regression Testing -- Regression suites that agents help maintain
GitHub Actions -- CI pipeline including the eval surface sync
Test Cases -- Test case strategy that agents follow during TDD
Agent specs: .claude/agents/*.md
Guidance consistency script: .claude/scripts/check-guidance-consistency.js

Next best action

See how git hooks enforce code quality gates before code reaches the repository.

Husky Git Hooks

What It Checks​

Active Eval Surfaces​

Model Selection​

Evaluation Criteria​

Change Quality​

Product Acceptance Quality​

How It's Configured​

The Three-Strike Protocol​

Guidance Consistency​

Inner-Loop Policy​

Running & Troubleshooting​

Eval Surface Sync​

Lessons Learned​

Resources​