Skip to content

AI-Generated Eval Harness

The AI eval harness feature lets you describe what you want to achieve in plain English — and retrAI writes the test harness for you.

The Problem It Solves

Sometimes you want the agent to improve code but there's no pre-existing test suite. For example:

  • "Optimize this SQL query to run in < 50ms"
  • "Make the sorting function handle duplicates"
  • "Ensure the API returns a JSON data wrapper"

These aren't bugs in existing tests — they're new requirements. With generate-eval, you describe the requirement and get a pytest harness that captures it.


Two-Phase Architecture

Phase 1: Planning Agent
  Input: natural language description + project files
  Output: .retrai/eval_harness.py

Phase 2: Implementation Agent
  Input: eval harness + project source
  Goal: make pytest .retrai/eval_harness.py pass

Phase 1 — Planning Agent

retrai generate-eval calls the LLM once with: - Your description - The project file listing - Key config files (pyproject.toml, package.json, etc.) - Up to 5 main source files

The LLM writes a complete, runnable pytest file that tests your requirement.

Phase 2 — Implementation Agent

retrai run ai-eval runs the standard agent loop, but: - AiEvalGoal.check() runs pytest .retrai/eval_harness.py - AiEvalGoal.system_prompt() includes your original description so the agent understands why - The agent reads the harness, understands what it tests, then fixes the source code


Usage

# Step 1: Generate the harness
retrai generate-eval "make the bubble_sort function in sorting.py handle empty lists"

# This creates .retrai/eval_harness.py and prints it for review.

# Step 2: Run the agent
retrai run ai-eval

# Use a different model
retrai generate-eval "optimize DB query" --model claude-opus-4-6
retrai run ai-eval --model claude-opus-4-6 --max-iter 30

What Gets Created

.retrai/
├── eval_harness.py      The generated pytest test file
└── ai_eval_config.json  Metadata: description, model, timestamp

Example .retrai/ai_eval_config.json:

{
  "description": "make bubble_sort handle empty lists",
  "harness_file": ".retrai/eval_harness.py",
  "generated_at": "2026-02-17T12:00:00+00:00",
  "model": "claude-sonnet-4-6"
}


Tips

  • Review the harness before running the agent — make sure it tests what you intended
  • Be specific in your description: "make sort() handle lists with None values" is better than "fix sort"
  • Iterative refinement: if the generated harness isn't right, just re-run generate-eval with a clearer description
  • Don't modify the harness while the agent is running — AiEvalGoal explicitly tells the agent not to touch it