Warren Evals — Agent Quality System
Evidence of the 3-level evaluation system that ensures Warren's outputs meet Tony-calibrated quality standards. All components live and running on production work.
Pass Rate (10/11)
Agreement (8/8)
Rate (14/15)
Corpus Entries
Domains
in Production
The Problem
Warren produces 100+ outputs per day — BD dailys, sprint plans, product analyses, PR reviews, client deliverables. Tony can't review everything. We needed an automated quality system that catches judgment failures — not just "did it run?" but "did it make the right call?"
The Solution
A 3-level evaluation system built on Tony's actual judgment patterns. A cross-model AI judge (GLM 5.1) scores Warren's real outputs against rubrics extracted from 86 real Tony verdicts. The system runs both pre-delivery (quality gate before outputs reach clients) and post-delivery (weekly shadow review of all outputs).
Key Design Decisions
- Cross-model judging: GLM 5.1 (open-weights) judges Warren's Claude outputs — not self-eval. The agent never grades its own homework.
- Tony's real verdicts = ground truth: 86 entries mined from actual production evaluations, not synthetic test cases.
- Passthrough on error: API failures, timeouts, or missing keys never block delivery. Safety net, not a blocker.
- Domain-matched rubrics: Each output is evaluated against the correct domain rubric (sales-bd, product-scope, etc.), not a generic "is it good?" prompt.
- Binary verdicts: PASS or FAIL. No "mostly good" — this forces the rubrics to be precise.
Each rubric encodes Tony's specific judgment patterns from real production evaluations. The rubric content IS Tony's criteria — extracted verbatim from his actual verdicts, structured into machine-parsable YAML.
🎯 Sales & BD
📦 Product Scope
⚖️ Effort-Value
⚙️ Process
🧠 Behavioral
| Domain | Total | Pass | Fail | Score |
|---|---|---|---|---|
| Behavioral | 1 | 1 | 0 | 100% |
| Effort-Value | 3 | 3 | 0 | 100% |
| Process | 2 | 1 | 1 | 50% |
| Product-Scope | 2 | 2 | 0 | 100% |
| Sales-BD | 3 | 3 | 0 | 100% |
| Overall | 11 | 10 | 1 | 91% |
The 1 Failure — Transparency Example
SR-008 (Process): PR #371 escalated to needs-human after 5 resolution-loop cycles without fixing CI. Judge ruled: should have presented clear options for human decision rather than continuing to loop. This is exactly the kind of judgment failure the system is designed to catch.
Sample Passes (What Good Looks Like)
- SR-001 (Sales-BD): BD Daily with 7 active opportunities, 2 new. Each opportunity had specific next actions and timeline.
- SR-003 (Sales-BD): Synthesized Tony's WWTD audit + 8 Gemini transcripts + Trent email into cohesive sales strategy.
- SR-006 (Product-Scope): Admitted upfront what was NOT known: "I don't have access to Deloitte's cluster data" — honesty over fabrication.
- SR-009 (Effort-Value): 1-line href fix for a broken empty state nav — minimal execution plan for a minimal task.
15 deliberately crafted bad outputs that should ALL fail. Each embodies a specific anti-pattern Tony has flagged in production. If the rubric misses any, it has a blind spot.
The critical question: does the AI judge agree with Tony's real verdicts? We measure this on a holdout set — entries the rubrics never saw during creation.
| Domain | Entries | Agreed | False Pass | False Fail | Agreement |
|---|---|---|---|---|---|
| Behavioral | 2 | 2 | 0 | 0 | 100% |
| Effort-Value | 1 | 1 | 0 | 0 | 100% |
| Process | 1 | 1 | 0 | 0 | 100% |
| Product-Scope | 3 | 3 | 0 | 0 | 100% |
| Sales-BD | 1 | 1 | 0 | 0 | 100% |
| Overall | 8 | 8 | 0 | 0 | 100% |
Honest Caveat
100% agreement on 8 entries is a good start, but the sample is small and may skew toward easier cases. As the corpus grows and we test harder edge cases, we expect agreement to settle around 90%+. That's the real target.
The quality gate runs synchronously before these outputs reach Slack channels. If it fails, the output is revised before delivery.
| Delivery Point | Domain | How It's Triggered | Status |
|---|---|---|---|
| BD Daily (cron) | sales-bd | ~/bin/bd-daily.sh | LIVE |
| BD Daily Alert (SOP) | sales-bd | orgloop/sops/bd-daily-alert.md | LIVE |
| BD Weekly Recap (SOP) | sales-bd | orgloop/sops/bd-weekly-recap.md | LIVE |
| Product Judgment Gate (SOP) | product-scope | orgloop/sops/product-judgment-gate.md | LIVE |
| Sprint Kickoff (SOP) | product-scope | orgloop/sops/aipmo-approval.md | LIVE |
🔄 Expand Gate Coverage
Wire quality gate into more delivery points: escalation notifications, deployment reports, client-facing status updates. Every external-facing output should pass the gate.
📈 Calibration Drift Monitoring
Track PASS/FAIL rates over time. If pass rate drops below 90%, rubrics may need recalibration. Build a dashboard that surfaces trends automatically.
🧑🤝🧑 Multi-SME Expansion
Currently calibrated to Tony's judgment only. Next: mine Charlie's engineering verdicts and Victor's process/delivery verdicts into the corpus. Each SME's patterns become new rubric dimensions.
🔗 CI Integration
Wire evals into GitHub Actions. Any PR that touches prompts, SOPs, or agent config automatically runs the eval suite. Regressions caught before merge.
📊 Gate Analytics Dashboard
Surface gate-failures.jsonl data in a live dashboard. Track: failure rate by domain, most common failure patterns, revision success rate, time-to-pass trends.
🎯 Corpus Growth
86 entries is a strong start. Target: 200+ entries across all domains. More data = more precise rubrics = better judgment calibration. Every Tony correction becomes a new corpus entry automatically.
⚡ The Key Insight
WWTD (What Would Tony Do) is the law. Evals are the court system.
Having principles documented isn't enough — you need a system that consistently interprets and enforces them. WWTD tells Warren what good judgment looks like. The eval system proves whether Warren's actual outputs match that judgment.
The rubrics aren't generic "is it good?" prompts. They're Tony's specific criteria — extracted verbatim from his real production verdicts, with references to exact corpus entries. When the AI judge says PASS, it's saying "Tony would approve this." When it says FAIL, it's saying "Tony would flag this" — and it tells you exactly which criterion was violated.
Upgrade the current weekly shadow review into a daily automated loop that continuously improves Warren's judgment calibration. Every day the system gets smarter.
🎯 What Changes
| Today (Weekly) | Proposed (Daily) | |
|---|---|---|
| Shadow Review | Saturdays 3AM — static queue of 11 entries | Every day 3AM — auto-collected from previous 24h |
| Output Collection | Manual — requires running shadow-collect.py with an export | Automatic — Slack API pulls Warren's outputs from all channels |
| Feedback Loop | Report saved to disk. No automatic action. | Failures feed back: logged, pattern-tracked, rubric refinement triggered |
| Corpus Growth | Manual — human adds entries to corpus | Semi-automatic — every Tony correction becomes a candidate corpus entry |
| Quality Gate | Pre-delivery on 5 SOPs (already live) | Same + trend data from daily reviews informs gate sensitivity |
🔁 The Self-Improving Flywheel
⚙️ Technical Implementation
- Auto-collect script: New
shadow-auto-collect.sh— calls Slack API for each monitored channel, filters Warren's messages, classifies by domain, writes daily queue. - Cron change:
0 3 * * *(daily) instead of0 3 * * 6(Saturdays only). - Failure feedback: Every FAIL appended to
gate-failures.jsonlwith structured metadata. Weekly rollup flags patterns. - Corpus auto-ingestion: Tony corrections staged as candidates. Human review required before promotion — no auto-corruption of ground truth.
💡 Why This Matters
Making it daily + auto-collecting + feedback-looping turns it into a system that gets measurably better every day.
- Week 1: 86 corpus entries, ~10 outputs/day evaluated
- Month 1: ~130 corpus entries, rubrics refined from daily failure patterns
- Month 3: 200+ corpus entries, multi-SME, measurable trend line
This is the AIPMO differentiator. Not "our agent follows instructions" — but "our agent has a provable, continuously-improving quality system calibrated to your SMEs."
✋ Decision Requested
Approve upgrading the eval system from weekly static review to daily auto-collected self-improving engine.
- Scope: Auto-collect + daily cron + failure feedback loop + corpus candidate ingestion
- Timeline: Core operational within 1 week. Trend dashboard within 2 weeks.
- Cost: ~$0.10–0.30/day (GLM 5.1 via Together API)
- Risk: Zero. Passthrough design — if anything fails, nothing breaks.