⚡ How Warren Evals Work

A visual explanation of the evaluation system Victor & Charlie built with Warren — what it is, how it works, why WWTD alone isn't enough, and what's left to do.
🎯 The Problem We're Solving
Warren produces 100+ outputs per day. How do we know they're good?

Tony can't review everything. We need an automated system that catches quality regressions — not just "did it run?" but "did it make the right call?"

❌ Without Evals

  • Tony catches problems by accident
  • Bad outputs ship before anyone notices
  • Config changes break judgment silently
  • No way to measure improvement over time
  • "Trust me, it works" is the only evidence

✅ With Evals

  • Every prompt/config change is tested automatically
  • LLM judge scores outputs using Tony's own criteria
  • Regressions caught before they reach clients
  • Measurable agreement rate: judge vs. human
  • Data-driven evidence of agent quality
🔄 The Eval Loop — How It Works
🗂️
Capture
📏
Encode
🧪
Test
📊
Measure
🔁
Iterate
This loop runs continuously — every time we change Warren's prompts, SOPs, or config
Step 1 — Capture
Mine real human judgments
We extracted 68 of Tony's real decisions from session history — every time he said "this is wrong," "this is right," or taught Warren a principle. Each entry has: the agent output, Tony's verdict (PASS/FAIL/CORRECTION), and his reasoning.
Step 2 — Encode
Turn judgment into scoring rubrics
Tony's 36 "teaching" entries become the rubric criteria — the rules that define what PASS means. These get structured into YAML files that an LLM judge can use to score any new output. The rubrics ARE Tony's judgment, codified.
Step 3 — Test
LLM judge scores against rubrics
A different model (not the one Warren uses) reads the rubric + the agent output and scores it. Cross-model judging avoids the agent "grading its own homework." Binary: PASS or FAIL, no middle ground.
Step 4 — Measure
Compare LLM judge vs. human verdicts
We check: does the LLM judge agree with Tony's actual verdicts? Target: >90% agreement. First result: 100% on easy held-out examples — but the sample is small and the easy cases are easy. We need to stress-test with harder cases.
Step 5 — Iterate
Where they disagree → refine the rubric
Every disagreement between judge and human reveals a rubric gap. Fix the rubric, re-test, repeat until agreement is stable. The 8 calibration scenarios we sent Tony are specifically designed to find these gaps.
🏗️ Three Layers of Evaluation
L1
Mechanical Execution
Did the agent use the right tools in the right order? Fully automated — programmatic asserts, no LLM needed.
Ex: "Did it commit within 10 min?" "Did it push the branch?"
L2
Process Judgment
Did it follow the right process? Semi-automated with lightweight rubrics — pipeline routing, labels, SOP dispatch.
Ex: "Did it route the issue to the right label?" "Did it follow the SOP?"
L3
Product Judgment
Is it building the right thing? This is the hard one. Requires SME calibration — Tony for product/sales, Charlie for pipeline/tech.
Ex: "Did it choose the right scope?" "Did it read the customer's real need?"
WWTD (What Would Tony Do) lives in Layer 3

WWTD captures Tony's product and business judgment — the hardest layer. But Layers 1 and 2 are equally critical. An agent that makes the right product call but breaks the build pipeline or misroutes an issue still fails. We need all three layers covered.

🔭 Why Calibrate Beyond WWTD?

📖 WWTD Alone

  • Static document — a RAG knowledge base
  • Describes Tony's principles in prose
  • Warren reads it and tries to follow it
  • No way to verify if it's actually working
  • Only covers Tony's domains (product/sales)
  • Principles can conflict — no resolution rules

🧪 WWTD + Evals

  • WWTD = the rules; Evals = the enforcement
  • Automated testing proves compliance
  • Catches regressions on every change
  • Covers ALL domains (product + tech + process)
  • Multiple SMEs: Tony, Charlie, Victor
  • Calibration scenarios resolve principle conflicts
The analogy: WWTD is the law. Evals are the court system.

Having laws isn't enough — you need a system that consistently interprets and enforces them. WWTD tells Warren what good judgment looks like. Evals prove whether Warren's actual outputs match that judgment. And the calibration scenarios we sent Tony? Those are the hard cases that test where the law is ambiguous — where two principles point in different directions.

Beyond Tony — the full SME map:
🎯
Tony
Product scope, sales/BD, customer judgment, design bar
⚙️
Charlie
Pipeline, architecture, code quality, technical correctness
📋
Victor
Process, delivery, operational quality, team coordination
📍 Where We Are — Status & What's Left
Framework design & architecture doc Done
Full eval framework proposal written and committed to ops repo
Phase 1 mining — 68 Tony judgments extracted Done
19 FAILs, 7 corrections, 6 PASSes, 36 teaching directives from 17 memory files
Reply-audit hook (live behavioral monitor) Done
Audits every Warren message for anti-patterns: permission loops, false "done" claims, option menus
8 calibration scenarios sent to Tony Done
Each scenario pits two of Tony's principles against each other to find rubric gaps
Awaiting Tony's responses on 8 scenarios Now
Need: PASS/FAIL/CONDITIONAL + reasoning + what most people would get wrong. Voice notes OK.
Write PromptFoo rubrics (YAML) Next
Structured scoring rubrics for product-scope and sales/BD domains
Build custom PromptFoo provider Next
Wraps OpenClaw as a target so PromptFoo can send test inputs and capture outputs
First agreement measurement round Next
LLM judge vs. Tony's actual verdicts — target >90% agreement
Wire into CI (GitHub Actions) Future
Evals run automatically on every PR that touches prompts, SOPs, or agent config
Expand SME coverage (Charlie + Victor) Future
Mine Charlie's engineering judgments and Victor's process judgments into the corpus
⚖️ The 8 Calibration Scenarios — What We Sent Tony

Each scenario intentionally creates a conflict between two of Tony's principles. There's no "obvious" right answer — that's the point. Tony's verdict becomes ground truth.

Scenario 1
Scope vs. Relationship
Build unnecessary abstraction layer to protect client's identity concerns?
Tension: scope discipline ↔ relationship capital
Scenario 2
Effort-Value vs. Stated Need
Build what client asked for or what they actually need?
Tension: customer respect ↔ 80/20 efficiency
Scenario 3
Speed vs. Design Bar
Add features or polish visuals before a demo?
Tension: ship it ↔ it must look good
Scenario 4
Known Pattern vs. New Info
Follow the rule or adapt to the situation?
Tension: documented pattern ↔ context-aware judgment
Scenario 5
Truth vs. Perception
Present accurate analysis that embarrasses a key stakeholder?
Tension: truth first ↔ political awareness
Scenario 6
Scope vs. Reliability
Ship minimal webhook or add production-grade resilience?
Tension: build for need ↔ build for trust
Scenario 7
Show Don't Tell vs. Readiness
Demo when they explicitly asked for a document?
Tension: show don't tell ↔ customer preference
Scenario 8
Confidence Inversion
Safe competitive comparison or risky insight-driven plan?
Tension: expected deliverable ↔ trust the insight