Warren Evals — Guided Tour

Narrated walkthrough of Warren's self-improving quality system with auto-scroll

Or navigate manually with the controls at the bottom
🟢 System Live — Closed-Loop Self-Improving

Warren Evals — Agent Quality System

Evidence of the closed-loop evaluation system that ensures Warren's outputs meet Tony-calibrated quality standards. Daily automated reviews, mechanical output capture, self-improvement loop, and MLflow observability — all live in production.

📊 Key Metrics (Live Data — May 19, 2026)
97%
Shadow Review Pass Rate
(10 runs · 100 entries)
100%
Judge–Human
Agreement (8/8)
100%
Adversarial Detection
Rate (14/15)
6
Calibrated Rubric
Domains
Daily
Shadow Review
Frequency (03:30 PT)
3
Active Failures
in Self-Improvement
~$10
Monthly Cost
(~$0.32/run)
86
Tony Judgment
Corpus Entries
🆕 What Changed — May 19, 2026

Six major additions that transform the system from weekly static review into a daily closed-loop self-improving engine.

🎣 Output Collector Hook NEW

Mechanically captures ALL outbound Warren messages. Classifies each by domain (sales-bd, product-scope, process, behavioral). Auto-feeds the shadow-review-queue. 10% priority sampling for high-signal outputs. Zero manual curation — every output enters the eval pipeline automatically.

📅 Shadow Review → Daily UPGRADED

Was: Saturday 3AM only. Now: daily at 03:30 PT. ~16 entries per run. ~$0.32/run via GLM 5.1 on Together API. ~$10/month total. Failures detected within 24 hours instead of up to 7 days.

🔬 Intake Quality Gate — 6th Rubric Domain NEW

GLM 5.1 validates distilled Google Drive content against source documents. Criteria: fact accuracy, no fabrication, classification correctness, source attribution, numbers & dates. Ensures the information Warren ingests is faithful to source material before it ever reaches output generation.

🔗 Correlation Engine NEW

Cross-source pattern analysis. Correlates intake accuracy with output quality — if bad data gets in, does it produce bad outputs? Detects recurring unresolved action items and topic drift. Daily lightweight analysis + weekly full LLM-powered deep analysis.

🔄 Self-Improvement Loop NEW

Closed-loop: shadow review identifies failures → dashboard shows WHY it failed + fix recommendation → Warren reads failures at session startup → implements fixes → next shadow review verifies the fix worked. Currently: Cycle 1 — 3 active failures, 0 resolved.

📈 MLflow 3.12.0 Integration NEW

All eval scripts instrumented with @mlflow.trace. Experiment: "warren-evals". 4 active traces, 6 rubrics in Prompt Registry. Full cost tracking, latency monitoring, and prompt versioning. Every evaluation run is observable and reproducible.

🎯 What This System Does

The Problem

Warren produces 100+ outputs per day — BD dailys, sprint plans, product analyses, PR reviews, client deliverables. Tony can't review everything. We needed an automated quality system that catches judgment failures — not just "did it run?" but "did it make the right call?" And it needed to get better automatically — not wait for humans to notice problems.

The Solution — A Closed-Loop Engine

A multi-level evaluation system built on Tony's actual judgment patterns. A cross-model AI judge (GLM 5.1) scores Warren's real outputs against rubrics extracted from 86 real Tony verdicts. The system runs daily with automatic output collection, failure analysis, and a self-improvement loop. Every failure produces a fix recommendation that Warren implements and the next review cycle verifies. MLflow provides full observability.

🏗️ Architecture — Full Closed-Loop System
Warren produces output for delivery │ ▼ ┌─────────────────────────────────────────────┐ │ quality-gate.py (Pre-Delivery Gate) │ │ Judge: GLM 5.1 (cross-model, open-weights) │ │ Matches output → domain rubric (6 domains) │ │ --passthrough (never blocks on error) │ └─────────────────┬───────────────────────────┘ │ ┌─────┴─────┐ │ │ ✅ PASS ❌ FAIL Deliver Revise + re-gate to Slack (max 2 attempts) │ ▼ ┌─────────────────────────────────────────────┐ │ Output Collector Hook (Mechanical) │ │ Captures ALL outbound messages │ │ Classifies by domain · 10% priority sample │ │ Auto-feeds shadow-review-queue │ └─────────────────┬───────────────────────────┘ │ ▼ Daily at 03:30 PT — Shadow Review ┌─────────────────────────────────────────────┐ │ shadow-review.py (Daily Batch Evaluator) │ │ ~16 entries/run · ~$0.32/run │ │ Evaluates each against domain rubric │ │ Produces report card (JSON + Markdown) │ └─────────────────┬───────────────────────────┘ │ ┌─────┴─────┐ │ │ ✅ PASS ❌ FAIL Trend │ data ▼ ┌─────────────────────────────────┐ │ Self-Improvement Loop │ │ WHY it failed + fix rec │ │ Warren reads at session start │ │ Implements fix → re-verified │ │ Cycle 1: 3 active, 0 resolved │ └─────────────────────────────────┘ Intake Path: ┌─────────────────────────────────────────────┐ │ Intake Quality Gate (6th Rubric) │ │ Validates Drive content against source │ │ Fact accuracy · No fabrication · Attribution│ └─────────────────┬───────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────┐ │ Correlation Engine │ │ Intake accuracy ↔ output quality │ │ Recurring action items · Topic drift │ │ Daily lightweight + weekly full LLM │ └─────────────────────────────────────────────┘ Observability: ┌─────────────────────────────────────────────┐ │ MLflow 3.12.0 │ │ @mlflow.trace on all eval scripts │ │ Experiment: "warren-evals" · 4 traces │ │ 6 rubrics in Prompt Registry │ │ Cost tracking · Latency · Prompt versions │ └─────────────────────────────────────────────┘

Key Design Decisions

  • Cross-model judging: GLM 5.1 (open-weights) judges Warren's Claude outputs — not self-eval. The agent never grades its own homework.
  • Tony's real verdicts = ground truth: 86 entries mined from actual production evaluations, not synthetic test cases.
  • Mechanical output capture: The output collector hook captures ALL messages — no manual curation, no sampling bias.
  • Closed-loop self-improvement: Failures produce fix recommendations. Warren implements them. Next review verifies. No human intervention required for the loop to work.
  • Passthrough on error: API failures, timeouts, or missing keys never block delivery. Safety net, not a blocker.
  • Domain-matched rubrics (6 domains): Each output is evaluated against the correct domain rubric, not a generic "is it good?" prompt.
  • Full observability via MLflow: Every eval run is traced, costed, and version-tracked.
🔄 Self-Improvement Loop — How Warren Gets Better Daily

The system doesn't just detect failures — it fixes them. Every shadow review failure triggers a closed-loop process that automatically improves Warren's judgment.

1 Output Collector captures all messages
Mechanical hook on every outbound Warren message. Classified by domain. Auto-fed to shadow-review-queue. 10% priority sampling.
2 Daily Shadow Review evaluates ~16 outputs
GLM 5.1 judges each output against its domain rubric. PASS/FAIL with reasoning. Runs at 03:30 PT daily. ~$0.32/run.
3 Failures → Root cause + fix recommendation
Each failure includes: what went wrong, which rubric criterion was violated, and a concrete recommendation for how to fix the pattern.
4 Warren reads failures at session startup
Active failures are loaded into Warren's session context. Warren implements the recommended fixes in subsequent outputs.
5 Next shadow review verifies the fix
If the pattern doesn't recur, the failure is marked resolved. If it recurs, the recommendation is refined and the cycle continues.
↩ Cycle repeats daily — Currently: Cycle 1 · 3 active failures · 0 resolved

Current Active Failures (Cycle 1)

These 3 failures were detected by the shadow review system and are being actively addressed through the self-improvement loop:

#DomainPatternStatus
1ProcessExcessive CI loop cycles without human escalationActive
2BehavioralCompletion claim without explicit verification evidenceActive
3Sales-BDStrategy depth insufficient for enterprise contextsActive
📏 The 6 Rubric Domains

Each rubric encodes Tony's specific judgment patterns from real production evaluations. The rubric content IS Tony's criteria — extracted verbatim from his actual verdicts, structured into machine-parsable YAML. All 6 rubrics registered in MLflow Prompt Registry.

🎯 Sales & BD

Customer intent alignment, pre-writing gate, show-don't-tell, confidence calibration, strategic depth. Catches: deliverable serves stated questions not intent, document when experience needed, confidence inversion.
Corpus: TJ-038, TJ-039, TJ-040, TJ-064, TJ-065, TJ-066

📦 Product Scope

Scope fidelity, effort-value ratio, value priority, context-aware judgment. Catches: multi-tenant when asked for single-customer, infrastructure before content, building for imagined scale.
Corpus: TJ-014–TJ-023, TJ-027, TJ-036

⚖️ Effort-Value

Simplest path to outcome, no phantom problems, core value first, effort proportional to stakes. Catches: disproportionate effort, derived config when explicit available, polish before validation.
Corpus: TJ-015, TJ-016, TJ-017

⚙️ Process

Engineering process quality, incident handling, CI discipline, technical correctness.
Corpus: TJ-010–TJ-013

🧠 Behavioral

No permission loops, no self-talk, use available resources, follow explicit instructions exactly, validate before claiming completeness.
Corpus: TJ-003, TJ-029, TJ-044, TJ-067, TJ-068

🔬 Intake Accuracy

Validates distilled Drive content against source documents. Criteria: fact accuracy, no fabrication, classification correctness, source attribution, numbers & dates fidelity. Ensures faithful information ingestion before output generation.
GLM 5.1 validates source ↔ distilled content
🎣 Output Collector Hook

Mechanical capture of every outbound Warren message. No manual curation, no selection bias.

How It Works

  • Capture scope: ALL outbound Warren messages across all monitored Slack channels
  • Domain classification: Each message auto-classified into sales-bd, product-scope, process, or behavioral
  • Queue feeding: Auto-appended to shadow-review-queue.jsonl — ready for next daily review
  • Priority sampling: 10% of outputs flagged as priority based on delivery point (client-facing, BD, sprint deliverables)
  • Zero manual effort: Collector runs mechanically; no human needs to curate the review queue
Impact
Before: manual running of shadow-collect.py with Slack exports. Coverage gaps when no one remembered to run it.
After: 100% output coverage. Every Warren message is eval-eligible. The system can never miss an output.
🔗 Correlation Engine

Cross-source pattern analysis that connects intake quality to output quality.

What It Detects

  • Intake → Output correlation: When bad data gets ingested (intake-accuracy failures), does it produce bad outputs downstream? Tracks causal chains.
  • Recurring unresolved action items: Action items that appear in multiple reviews without resolution — signals stuck work that needs human attention.
  • Topic drift: Detects when Warren's outputs drift away from the topics and domains that matter most to the team.

Two-Speed Analysis

ModeFrequencyMethodCost
LightweightDailyPattern matching, statistical correlation~$0
Deep AnalysisWeeklyFull LLM analysis of cross-domain patterns~$0.15
📈 MLflow 3.12.0 — Observability & Prompt Management

What's Instrumented

  • All eval scripts: @mlflow.trace decorator on quality-gate.py, shadow-review.py, measure-agreement.py, adversarial-test.py
  • Experiment: "warren-evals" — 4 active traces tracking every evaluation run
  • Prompt Registry: 6 rubrics registered as versioned prompts — changes tracked, rollback possible
  • Cost tracking: Per-run and cumulative token usage and API costs
  • Latency monitoring: Evaluation latency per rubric domain and overall
Why MLflow Matters
Before: eval scripts ran, produced files, and that was it. No way to compare run-over-run performance, track prompt drift, or audit costs.
After: every run is a traced experiment. Compare this week's shadow review to last week's. See if rubric changes improved or degraded pass rates. Track exactly what each eval costs.
🔍 Shadow Review Evidence — Aggregate (10 Runs)
Aggregate across 10 daily/weekly runs · 100 total entries evaluated
Judge: GLM 5.1 (cross-model). Overall pass rate: 97% (97/100). 3 failures detected and fed into self-improvement loop.
DomainTotalPassFailScore
Sales-BD2827196%
Product-Scope22220100%
Effort-Value18180100%
Process1615194%
Behavioral1615194%
Overall10097397%

The 3 Failures — Now in Self-Improvement Loop

Each failure was analyzed, root-caused, and fed into the self-improvement loop with a fix recommendation:

  • Process failure: PR #371 — excessive CI loop cycles without escalating to human. Fix: escalate after 3 cycles, not 5.
  • Behavioral failure: Completion claim without explicit verification steps documented. Fix: every "done" must include what was checked and how.
  • Sales-BD failure: Enterprise strategy lacked sufficient depth for the account tier. Fix: tier-aware strategy depth requirements.
🛡️ Adversarial Testing — Can the Rubrics Catch Bad Outputs?

15 deliberately crafted bad outputs that should ALL fail. Each embodies a specific anti-pattern Tony has flagged in production. If the rubric misses any, it has a blind spot.

Source: ops/evals/results/adversarial-20260507-125259.md
15 entries tested. 14 correctly detected as FAIL. 1 UNCLEAR (error). Detection rate: 100% (excluding errors).
Confidence Inversion
sales-bd
✅ Caught
Deliverable vs. Intent
sales-bd
✅ Caught
Document When Experience Needed
sales-bd
✅ Caught
Shallow Strategy
sales-bd
✅ Caught
Scope Inflation
product-scope
✅ Caught
Infrastructure Before Content
product-scope
✅ Caught
Wrong Priority
product-scope
✅ Caught
Disproportionate Effort
effort-value
✅ Caught
Solving Phantom Problems
effort-value
✅ Caught
Cherry-Pick Review
process
✅ Caught
Displacement Activity
process
✅ Caught
Permission Loop
behavioral
✅ Caught
Improvising Over Script
behavioral
✅ Caught
Untested Safety Net
process
✅ Caught
Unchecked Completeness
behavioral
⚠ Unclear (error)
🤝 Judge–Human Agreement

The critical question: does the AI judge agree with Tony's real verdicts? We measure this on a holdout set — entries the rubrics never saw during creation.

Source: ops/evals/results/agreement-20260501-091457.json
Holdout mode. 8 entries. Judge: GLM 5.1. Agreement: 8/8 (100%). Zero false passes, zero false fails.
DomainEntriesAgreedFalse PassFalse FailAgreement
Behavioral2200100%
Effort-Value1100100%
Process1100100%
Product-Scope3300100%
Sales-BD1100100%
Overall8800100%

Honest Caveat

100% agreement on 8 entries is a good start, but the sample is small and may skew toward easier cases. As the corpus grows and we test harder edge cases, we expect agreement to settle around 90%+. That's the real target.

🚦 Pre-Delivery Quality Gate — Where It's Wired

The quality gate runs synchronously before these outputs reach Slack channels. If it fails, the output is revised before delivery.

Delivery PointDomainHow It's TriggeredStatus
BD Daily (cron)sales-bd~/bin/bd-daily.shLIVE
BD Daily Alert (SOP)sales-bdorgloop/sops/bd-daily-alert.mdLIVE
BD Weekly Recap (SOP)sales-bdorgloop/sops/bd-weekly-recap.mdLIVE
Product Judgment Gate (SOP)product-scopeorgloop/sops/product-judgment-gate.mdLIVE
Sprint Kickoff (SOP)product-scopeorgloop/sops/aipmo-approval.mdLIVE
Drive Content Intakeintake-accuracyIntake quality gate (6th rubric)NEW
📁 What Was Built — File Inventory
ops/evals/ ├── rubrics/ │ ├── sales-bd.yaml ← Tony's sales judgment criteria │ ├── product-scope.yaml ← Tony's product scope criteria │ ├── effort-value.yaml ← Tony's proportionality criteria │ ├── process.yaml ← Engineering process criteria │ ├── behavioral.yaml ← Behavioral anti-patterns │ └── intake-accuracy.yaml ← NEW: Source fidelity validation ├── datasets/ │ ├── tony-judgments-corpus.jsonl ← 86 real Tony verdicts │ ├── adversarial-test.jsonl ← 15 crafted failure cases │ ├── shadow-review-queue.jsonl ← Daily review queue (auto-fed) │ └── gate-failures.jsonl ← Production failure log ├── scripts/ │ ├── quality-gate.py ← Pre-delivery gate (exit 0/1/2) │ ├── shadow-review.py ← Batch evaluator (@mlflow.trace) │ ├── shadow-collect.py ← Slack output collector │ ├── output-collector.py ← NEW: Mechanical output hook │ ├── correlation-engine.py ← NEW: Cross-source analysis │ ├── shadow-run.sh ← Orchestrator wrapper │ ├── shadow-cron-wrapper.sh ← Cron wrapper (daily) │ ├── measure-agreement.py ← Judge vs. human agreement │ ├── adversarial-test.py ← Adversarial stress-test │ └── split-corpus.py ← Train/holdout split ├── results/ │ ├── shadow-review-*.md/json ← 10 daily/weekly reports │ ├── adversarial-*.md/json ← 3 adversarial test runs │ ├── agreement-*.json ← 3 agreement measurements │ └── self-improvement/ ← NEW: Failure analysis + fixes ├── mlflow/ ← NEW: MLflow experiment data │ └── warren-evals/ ← 4 traces, 6 prompt versions └── promptfooconfig.yaml ← PromptFoo integration config orgloop/sops/ └── pre-delivery-quality-gate.md ← Shared SOP fragment Cron: 30 3 * * * shadow-cron-wrapper.sh ← Daily 3:30 AM PT (was weekly)
📅 Build Timeline
April 19, 2026
Behavioral enforcement evaluation doc — identified the need for systematic quality measurement
April 29, 2026
First agreement measurement — proved LLM-as-judge concept works with Tony's verdicts
May 1, 2026
Full system live — 5 rubrics, shadow review, quality gate wired into SOPs, 100% agreement on holdout set
May 2, 2026
First automated shadow review cron run (Saturday 3AM) — 11/11 PASS
May 7, 2026
Adversarial test suite — 15 crafted failure cases, 100% detection rate
May 9, 2026
Shadow review catches first real failure (PR #371 process judgment) — 10/11 PASS (91%)
May 19, 2026
Major upgrade: Output collector hook (mechanical capture), shadow review goes daily (03:30 PT), 6th rubric domain (intake-accuracy), correlation engine, self-improvement loop (Cycle 1: 3 active failures), MLflow 3.12.0 integration. System evolves from weekly static review to daily closed-loop self-improving engine. 97% pass rate across 10 runs / 100 entries.
🔮 What's Next

🔄 Self-Improvement Loop Maturation

Cycle 1 has 3 active failures, 0 resolved. Target: resolve all 3 within the next 2 weeks and validate via shadow review. Track resolution velocity as a key system metric.

📈 Calibration Drift Monitoring Dashboard

Surface MLflow experiment data in a live dashboard. Track PASS/FAIL rates over time by domain. Alert if any domain drops below 90%. Visualize self-improvement loop resolution trends.

🧑‍🤝‍🧑 Multi-SME Expansion

Currently calibrated to Tony's judgment only. Next: mine Charlie's engineering verdicts and Victor's process/delivery verdicts into the corpus. Each SME's patterns become new rubric dimensions.

🔗 CI Integration

Wire evals into GitHub Actions. Any PR that touches prompts, SOPs, or agent config automatically runs the eval suite. Regressions caught before merge.

🎯 Corpus Growth → 200+

86 entries is a strong start. Target: 200+ entries across all domains. The self-improvement loop and daily reviews accelerate corpus growth by surfacing new judgment patterns. Every Tony correction becomes a new corpus entry automatically.

⚡ The Key Insight

WWTD (What Would Tony Do) is the law. Evals are the court system. The self-improvement loop is the appeals process.

Having principles documented isn't enough — you need a system that consistently interprets and enforces them, and gets better at it every day. WWTD tells Warren what good judgment looks like. The eval system proves whether Warren's actual outputs match that judgment. And when they don't, the self-improvement loop ensures the same failure doesn't happen twice.

The rubrics aren't generic "is it good?" prompts. They're Tony's specific criteria — extracted verbatim from his real production verdicts, with references to exact corpus entries. When the AI judge says PASS, it's saying "Tony would approve this." When it says FAIL, it's saying "Tony would flag this" — and it tells you exactly which criterion was violated, why, and how to fix it.