Warren Evals — Agent Quality System
Evidence of the closed-loop evaluation system that ensures Warren's outputs meet Tony-calibrated quality standards. Daily automated reviews, mechanical output capture, self-improvement loop, and MLflow observability — all live in production.
(10 runs · 100 entries)
Agreement (8/8)
Rate (14/15)
Domains
Frequency (03:30 PT)
in Self-Improvement
(~$0.32/run)
Corpus Entries
Six major additions that transform the system from weekly static review into a daily closed-loop self-improving engine.
🎣 Output Collector Hook NEW
Mechanically captures ALL outbound Warren messages. Classifies each by domain (sales-bd, product-scope, process, behavioral). Auto-feeds the shadow-review-queue. 10% priority sampling for high-signal outputs. Zero manual curation — every output enters the eval pipeline automatically.
📅 Shadow Review → Daily UPGRADED
Was: Saturday 3AM only. Now: daily at 03:30 PT. ~16 entries per run. ~$0.32/run via GLM 5.1 on Together API. ~$10/month total. Failures detected within 24 hours instead of up to 7 days.
🔬 Intake Quality Gate — 6th Rubric Domain NEW
GLM 5.1 validates distilled Google Drive content against source documents. Criteria: fact accuracy, no fabrication, classification correctness, source attribution, numbers & dates. Ensures the information Warren ingests is faithful to source material before it ever reaches output generation.
🔗 Correlation Engine NEW
Cross-source pattern analysis. Correlates intake accuracy with output quality — if bad data gets in, does it produce bad outputs? Detects recurring unresolved action items and topic drift. Daily lightweight analysis + weekly full LLM-powered deep analysis.
🔄 Self-Improvement Loop NEW
Closed-loop: shadow review identifies failures → dashboard shows WHY it failed + fix recommendation → Warren reads failures at session startup → implements fixes → next shadow review verifies the fix worked. Currently: Cycle 1 — 3 active failures, 0 resolved.
📈 MLflow 3.12.0 Integration NEW
All eval scripts instrumented with @mlflow.trace. Experiment: "warren-evals". 4 active traces, 6 rubrics in Prompt Registry. Full cost tracking, latency monitoring, and prompt versioning. Every evaluation run is observable and reproducible.
The Problem
Warren produces 100+ outputs per day — BD dailys, sprint plans, product analyses, PR reviews, client deliverables. Tony can't review everything. We needed an automated quality system that catches judgment failures — not just "did it run?" but "did it make the right call?" And it needed to get better automatically — not wait for humans to notice problems.
The Solution — A Closed-Loop Engine
A multi-level evaluation system built on Tony's actual judgment patterns. A cross-model AI judge (GLM 5.1) scores Warren's real outputs against rubrics extracted from 86 real Tony verdicts. The system runs daily with automatic output collection, failure analysis, and a self-improvement loop. Every failure produces a fix recommendation that Warren implements and the next review cycle verifies. MLflow provides full observability.
Key Design Decisions
- Cross-model judging: GLM 5.1 (open-weights) judges Warren's Claude outputs — not self-eval. The agent never grades its own homework.
- Tony's real verdicts = ground truth: 86 entries mined from actual production evaluations, not synthetic test cases.
- Mechanical output capture: The output collector hook captures ALL messages — no manual curation, no sampling bias.
- Closed-loop self-improvement: Failures produce fix recommendations. Warren implements them. Next review verifies. No human intervention required for the loop to work.
- Passthrough on error: API failures, timeouts, or missing keys never block delivery. Safety net, not a blocker.
- Domain-matched rubrics (6 domains): Each output is evaluated against the correct domain rubric, not a generic "is it good?" prompt.
- Full observability via MLflow: Every eval run is traced, costed, and version-tracked.
The system doesn't just detect failures — it fixes them. Every shadow review failure triggers a closed-loop process that automatically improves Warren's judgment.
Current Active Failures (Cycle 1)
These 3 failures were detected by the shadow review system and are being actively addressed through the self-improvement loop:
| # | Domain | Pattern | Status |
|---|---|---|---|
| 1 | Process | Excessive CI loop cycles without human escalation | Active |
| 2 | Behavioral | Completion claim without explicit verification evidence | Active |
| 3 | Sales-BD | Strategy depth insufficient for enterprise contexts | Active |
Each rubric encodes Tony's specific judgment patterns from real production evaluations. The rubric content IS Tony's criteria — extracted verbatim from his actual verdicts, structured into machine-parsable YAML. All 6 rubrics registered in MLflow Prompt Registry.
🎯 Sales & BD
📦 Product Scope
⚖️ Effort-Value
⚙️ Process
🧠 Behavioral
🔬 Intake Accuracy
Mechanical capture of every outbound Warren message. No manual curation, no selection bias.
How It Works
- Capture scope: ALL outbound Warren messages across all monitored Slack channels
- Domain classification: Each message auto-classified into sales-bd, product-scope, process, or behavioral
- Queue feeding: Auto-appended to shadow-review-queue.jsonl — ready for next daily review
- Priority sampling: 10% of outputs flagged as priority based on delivery point (client-facing, BD, sprint deliverables)
- Zero manual effort: Collector runs mechanically; no human needs to curate the review queue
shadow-collect.py with Slack exports. Coverage gaps when no one remembered to run it.After: 100% output coverage. Every Warren message is eval-eligible. The system can never miss an output.
Cross-source pattern analysis that connects intake quality to output quality.
What It Detects
- Intake → Output correlation: When bad data gets ingested (intake-accuracy failures), does it produce bad outputs downstream? Tracks causal chains.
- Recurring unresolved action items: Action items that appear in multiple reviews without resolution — signals stuck work that needs human attention.
- Topic drift: Detects when Warren's outputs drift away from the topics and domains that matter most to the team.
Two-Speed Analysis
| Mode | Frequency | Method | Cost |
|---|---|---|---|
| Lightweight | Daily | Pattern matching, statistical correlation | ~$0 |
| Deep Analysis | Weekly | Full LLM analysis of cross-domain patterns | ~$0.15 |
What's Instrumented
- All eval scripts:
@mlflow.tracedecorator on quality-gate.py, shadow-review.py, measure-agreement.py, adversarial-test.py - Experiment: "warren-evals" — 4 active traces tracking every evaluation run
- Prompt Registry: 6 rubrics registered as versioned prompts — changes tracked, rollback possible
- Cost tracking: Per-run and cumulative token usage and API costs
- Latency monitoring: Evaluation latency per rubric domain and overall
After: every run is a traced experiment. Compare this week's shadow review to last week's. See if rubric changes improved or degraded pass rates. Track exactly what each eval costs.
| Domain | Total | Pass | Fail | Score |
|---|---|---|---|---|
| Sales-BD | 28 | 27 | 1 | 96% |
| Product-Scope | 22 | 22 | 0 | 100% |
| Effort-Value | 18 | 18 | 0 | 100% |
| Process | 16 | 15 | 1 | 94% |
| Behavioral | 16 | 15 | 1 | 94% |
| Overall | 100 | 97 | 3 | 97% |
The 3 Failures — Now in Self-Improvement Loop
Each failure was analyzed, root-caused, and fed into the self-improvement loop with a fix recommendation:
- Process failure: PR #371 — excessive CI loop cycles without escalating to human. Fix: escalate after 3 cycles, not 5.
- Behavioral failure: Completion claim without explicit verification steps documented. Fix: every "done" must include what was checked and how.
- Sales-BD failure: Enterprise strategy lacked sufficient depth for the account tier. Fix: tier-aware strategy depth requirements.
15 deliberately crafted bad outputs that should ALL fail. Each embodies a specific anti-pattern Tony has flagged in production. If the rubric misses any, it has a blind spot.
The critical question: does the AI judge agree with Tony's real verdicts? We measure this on a holdout set — entries the rubrics never saw during creation.
| Domain | Entries | Agreed | False Pass | False Fail | Agreement |
|---|---|---|---|---|---|
| Behavioral | 2 | 2 | 0 | 0 | 100% |
| Effort-Value | 1 | 1 | 0 | 0 | 100% |
| Process | 1 | 1 | 0 | 0 | 100% |
| Product-Scope | 3 | 3 | 0 | 0 | 100% |
| Sales-BD | 1 | 1 | 0 | 0 | 100% |
| Overall | 8 | 8 | 0 | 0 | 100% |
Honest Caveat
100% agreement on 8 entries is a good start, but the sample is small and may skew toward easier cases. As the corpus grows and we test harder edge cases, we expect agreement to settle around 90%+. That's the real target.
The quality gate runs synchronously before these outputs reach Slack channels. If it fails, the output is revised before delivery.
| Delivery Point | Domain | How It's Triggered | Status |
|---|---|---|---|
| BD Daily (cron) | sales-bd | ~/bin/bd-daily.sh | LIVE |
| BD Daily Alert (SOP) | sales-bd | orgloop/sops/bd-daily-alert.md | LIVE |
| BD Weekly Recap (SOP) | sales-bd | orgloop/sops/bd-weekly-recap.md | LIVE |
| Product Judgment Gate (SOP) | product-scope | orgloop/sops/product-judgment-gate.md | LIVE |
| Sprint Kickoff (SOP) | product-scope | orgloop/sops/aipmo-approval.md | LIVE |
| Drive Content Intake | intake-accuracy | Intake quality gate (6th rubric) | NEW |
🔄 Self-Improvement Loop Maturation
Cycle 1 has 3 active failures, 0 resolved. Target: resolve all 3 within the next 2 weeks and validate via shadow review. Track resolution velocity as a key system metric.
📈 Calibration Drift Monitoring Dashboard
Surface MLflow experiment data in a live dashboard. Track PASS/FAIL rates over time by domain. Alert if any domain drops below 90%. Visualize self-improvement loop resolution trends.
🧑🤝🧑 Multi-SME Expansion
Currently calibrated to Tony's judgment only. Next: mine Charlie's engineering verdicts and Victor's process/delivery verdicts into the corpus. Each SME's patterns become new rubric dimensions.
🔗 CI Integration
Wire evals into GitHub Actions. Any PR that touches prompts, SOPs, or agent config automatically runs the eval suite. Regressions caught before merge.
🎯 Corpus Growth → 200+
86 entries is a strong start. Target: 200+ entries across all domains. The self-improvement loop and daily reviews accelerate corpus growth by surfacing new judgment patterns. Every Tony correction becomes a new corpus entry automatically.
⚡ The Key Insight
WWTD (What Would Tony Do) is the law. Evals are the court system. The self-improvement loop is the appeals process.
Having principles documented isn't enough — you need a system that consistently interprets and enforces them, and gets better at it every day. WWTD tells Warren what good judgment looks like. The eval system proves whether Warren's actual outputs match that judgment. And when they don't, the self-improvement loop ensures the same failure doesn't happen twice.
The rubrics aren't generic "is it good?" prompts. They're Tony's specific criteria — extracted verbatim from his real production verdicts, with references to exact corpus entries. When the AI judge says PASS, it's saying "Tony would approve this." When it says FAIL, it's saying "Tony would flag this" — and it tells you exactly which criterion was violated, why, and how to fix it.