Warren Evals — Agent Quality System (Live Evidence)

📊 Key Metrics (Live Data)

91%

Latest Shadow Review
Pass Rate (10/11)

100%

Judge–Human
Agreement (8/8)

100%

Adversarial Detection
Rate (14/15)

86

Tony Judgment
Corpus Entries

5

Calibrated Rubric
Domains

0

Gate Failures
in Production

🎯 What This System Does

The Problem

Warren produces 100+ outputs per day — BD dailys, sprint plans, product analyses, PR reviews, client deliverables. Tony can't review everything. We needed an automated quality system that catches judgment failures — not just "did it run?" but "did it make the right call?"

The Solution

A 3-level evaluation system built on Tony's actual judgment patterns. A cross-model AI judge (GLM 5.1) scores Warren's real outputs against rubrics extracted from 86 real Tony verdicts. The system runs both pre-delivery (quality gate before outputs reach clients) and post-delivery (weekly shadow review of all outputs).

🏗️ Architecture — How It's Built

Warren produces output for delivery │ ▼ ┌─────────────────────────────────────────┐ │ quality-gate.py (Pre-Delivery Gate) │ │ Judge: GLM 5.1 (cross-model) │ │ Matches output → domain rubric │ │ --passthrough (never blocks on error) │ └─────────────────┬───────────────────────┘ │ ┌─────┴─────┐ │ │ ✅ PASS ❌ FAIL Deliver Revise + re-gate to Slack (max 2 attempts) │ ┌─────┴─────┐ │ │ ✅ PASS Still FAIL Deliver Deliver + self-note Every Saturday 3:00 AM PT — Shadow Review ┌─────────────────────────────────────────┐ │ shadow-review.py (Batch Evaluator) │ │ Collects week's outputs from Slack │ │ Evaluates each against domain rubric │ │ Produces report card (JSON + Markdown) │ └─────────────────────────────────────────┘

Key Design Decisions

Cross-model judging: GLM 5.1 (open-weights) judges Warren's Claude outputs — not self-eval. The agent never grades its own homework.
Tony's real verdicts = ground truth: 86 entries mined from actual production evaluations, not synthetic test cases.
Passthrough on error: API failures, timeouts, or missing keys never block delivery. Safety net, not a blocker.
Domain-matched rubrics: Each output is evaluated against the correct domain rubric (sales-bd, product-scope, etc.), not a generic "is it good?" prompt.
Binary verdicts: PASS or FAIL. No "mostly good" — this forces the rubrics to be precise.

📐 The 3 Levels of Evaluation

Level 1 — Rubric Internalization

Tony's judgment patterns → machine-readable rubrics

86 real Tony verdicts mined from production. 5 domain rubrics encode his specific PASS/FAIL criteria with references to actual corpus entries. This IS Tony's judgment, codified.

Level 2 — Shadow Review (Weekly)

Retrospective evaluation of real production outputs

Every Saturday at 3:00 AM, GLM 5.1 evaluates the week's outputs against domain rubrics. Results: a report card with PASS/FAIL per output, failure analysis, and trend tracking.

Level 3 — Pre-Delivery Quality Gate

Synchronous rubric check BEFORE outputs ship

Wired into 5 delivery SOPs. Every BD daily, BD weekly recap, product judgment, and sprint kickoff is gate-checked before reaching Slack. Failures trigger revision, not blocking.

📏 The 5 Rubric Domains

Each rubric encodes Tony's specific judgment patterns from real production evaluations. The rubric content IS Tony's criteria — extracted verbatim from his actual verdicts, structured into machine-parsable YAML.

🎯 Sales & BD

Customer intent alignment, pre-writing gate, show-don't-tell, confidence calibration, strategic depth. Catches: deliverable serves stated questions not intent, document when experience needed, confidence inversion.

Corpus: TJ-038, TJ-039, TJ-040, TJ-064, TJ-065, TJ-066

📦 Product Scope

Scope fidelity, effort-value ratio, value priority, context-aware judgment. Catches: multi-tenant when asked for single-customer, infrastructure before content, building for imagined scale.

Corpus: TJ-014–TJ-023, TJ-027, TJ-036

⚖️ Effort-Value

Simplest path to outcome, no phantom problems, core value first, effort proportional to stakes. Catches: disproportionate effort, derived config when explicit available, polish before validation.

Corpus: TJ-015, TJ-016, TJ-017

⚙️ Process

Engineering process quality, incident handling, CI discipline, technical correctness.

Corpus: TJ-010–TJ-013

🧠 Behavioral

No permission loops, no self-talk, use available resources, follow explicit instructions exactly, validate before claiming completeness.

Corpus: TJ-003, TJ-029, TJ-044, TJ-067, TJ-068

🔍 Shadow Review Evidence — Latest Run (May 9, 2026)

Source: ops/evals/results/shadow-review-20260509-030534.md

Automated weekly run. 11 real production outputs evaluated. Judge: GLM 5.1 (cross-model). 10/11 PASS (91%).

Domain	Total	Pass	Fail	Score
Behavioral	1	1	0	100%
Effort-Value	3	3	0	100%
Process	2	1	1	50%
Product-Scope	2	2	0	100%
Sales-BD	3	3	0	100%
Overall	11	10	1	91%

The 1 Failure — Transparency Example

SR-008 (Process): PR #371 escalated to needs-human after 5 resolution-loop cycles without fixing CI. Judge ruled: should have presented clear options for human decision rather than continuing to loop. This is exactly the kind of judgment failure the system is designed to catch.

Sample Passes (What Good Looks Like)

SR-001 (Sales-BD): BD Daily with 7 active opportunities, 2 new. Each opportunity had specific next actions and timeline.
SR-003 (Sales-BD): Synthesized Tony's WWTD audit + 8 Gemini transcripts + Trent email into cohesive sales strategy.
SR-006 (Product-Scope): Admitted upfront what was NOT known: "I don't have access to Deloitte's cluster data" — honesty over fabrication.
SR-009 (Effort-Value): 1-line href fix for a broken empty state nav — minimal execution plan for a minimal task.

🛡️ Adversarial Testing — Can the Rubrics Catch Bad Outputs?

15 deliberately crafted bad outputs that should ALL fail. Each embodies a specific anti-pattern Tony has flagged in production. If the rubric misses any, it has a blind spot.

Source: ops/evals/results/adversarial-20260507-125259.md

15 entries tested. 14 correctly detected as FAIL. 1 UNCLEAR (error). Detection rate: 100% (excluding errors).

Confidence Inversion

sales-bd

✅ Caught

Deliverable vs. Intent

sales-bd

✅ Caught

Document When Experience Needed

sales-bd

✅ Caught

Shallow Strategy

sales-bd

✅ Caught

Scope Inflation

product-scope

✅ Caught

Infrastructure Before Content

product-scope

✅ Caught

Wrong Priority

product-scope

✅ Caught

Disproportionate Effort

effort-value

✅ Caught

Solving Phantom Problems

effort-value

✅ Caught

Cherry-Pick Review

process

✅ Caught

Displacement Activity

process

✅ Caught

Permission Loop

behavioral

✅ Caught

Improvising Over Script

behavioral

✅ Caught

Untested Safety Net

process

✅ Caught

Unchecked Completeness

behavioral

⚠ Unclear (error)

🤝 Judge–Human Agreement

The critical question: does the AI judge agree with Tony's real verdicts? We measure this on a holdout set — entries the rubrics never saw during creation.

Source: ops/evals/results/agreement-20260501-091457.json

Holdout mode. 8 entries. Judge: GLM 5.1. Agreement: 8/8 (100%). Zero false passes, zero false fails.

Domain	Entries	Agreed	Agreement
Behavioral	2	2	100%
Effort-Value	1	1	100%
Process	1	1	100%
Product-Scope	3	3	100%
Sales-BD	1	1	100%
Overall	8	8	100%

Honest Caveat

100% agreement on 8 entries is a good start, but the sample is small and may skew toward easier cases. As the corpus grows and we test harder edge cases, we expect agreement to settle around 90%+. That's the real target.

🚦 Pre-Delivery Quality Gate — Where It's Wired

The quality gate runs synchronously before these outputs reach Slack channels. If it fails, the output is revised before delivery.

Delivery Point	Domain	How It's Triggered	Status
BD Daily (cron)	sales-bd	~/bin/bd-daily.sh	LIVE
BD Daily Alert (SOP)	sales-bd	orgloop/sops/bd-daily-alert.md	LIVE
BD Weekly Recap (SOP)	sales-bd	orgloop/sops/bd-weekly-recap.md	LIVE
Product Judgment Gate (SOP)	product-scope	orgloop/sops/product-judgment-gate.md	LIVE
Sprint Kickoff (SOP)	product-scope	orgloop/sops/aipmo-approval.md	LIVE

📁 What Was Built — File Inventory

ops/evals/ ├── rubrics/ │ ├── sales-bd.yaml ← Tony's sales judgment criteria │ ├── product-scope.yaml ← Tony's product scope criteria │ ├── effort-value.yaml ← Tony's proportionality criteria │ ├── process.yaml ← Engineering process criteria │ └── behavioral.yaml ← Behavioral anti-patterns ├── datasets/ │ ├── tony-judgments-corpus.jsonl ← 86 real Tony verdicts │ ├── adversarial-test.jsonl ← 15 crafted failure cases │ ├── shadow-review-queue.jsonl ← Weekly review queue │ └── gate-failures.jsonl ← Production failure log ├── scripts/ │ ├── quality-gate.py ← Pre-delivery gate (exit 0/1/2) │ ├── shadow-review.py ← Batch evaluator │ ├── shadow-collect.py ← Slack output collector │ ├── shadow-run.sh ← Orchestrator wrapper │ ├── shadow-cron-wrapper.sh ← Cron wrapper │ ├── measure-agreement.py ← Judge vs. human agreement │ ├── adversarial-test.py ← Adversarial stress-test │ └── split-corpus.py ← Train/holdout split ├── results/ │ ├── shadow-review-*.md/json ← 8 weekly reports │ ├── adversarial-*.md/json ← 3 adversarial test runs │ └── agreement-*.json ← 3 agreement measurements └── promptfooconfig.yaml ← PromptFoo integration config orgloop/sops/ └── pre-delivery-quality-gate.md ← Shared SOP fragment Cron: 0 3 * * 6 shadow-cron-wrapper.sh ← Saturday 3AM PT

📅 Build Timeline

April 19, 2026

Behavioral enforcement evaluation doc — identified the need for systematic quality measurement

April 29, 2026

First agreement measurement — proved LLM-as-judge concept works with Tony's verdicts

May 1, 2026

Full system live — 5 rubrics written, shadow review running, quality gate wired into SOPs, 100% agreement on holdout set

May 2, 2026

First automated shadow review cron run (Saturday 3AM) — 11/11 PASS

May 7, 2026

Adversarial test suite — 15 crafted failure cases, 100% detection rate

May 9, 2026

Latest shadow review — 10/11 PASS (91%), first real failure caught and documented

🔮 What Can Still Be Done

🔄 Expand Gate Coverage

Wire quality gate into more delivery points: escalation notifications, deployment reports, client-facing status updates. Every external-facing output should pass the gate.

📈 Calibration Drift Monitoring

Track PASS/FAIL rates over time. If pass rate drops below 90%, rubrics may need recalibration. Build a dashboard that surfaces trends automatically.

🧑‍🤝‍🧑 Multi-SME Expansion

Currently calibrated to Tony's judgment only. Next: mine Charlie's engineering verdicts and Victor's process/delivery verdicts into the corpus. Each SME's patterns become new rubric dimensions.

🔗 CI Integration

Wire evals into GitHub Actions. Any PR that touches prompts, SOPs, or agent config automatically runs the eval suite. Regressions caught before merge.

📊 Gate Analytics Dashboard

Surface gate-failures.jsonl data in a live dashboard. Track: failure rate by domain, most common failure patterns, revision success rate, time-to-pass trends.

🎯 Corpus Growth

86 entries is a strong start. Target: 200+ entries across all domains. More data = more precise rubrics = better judgment calibration. Every Tony correction becomes a new corpus entry automatically.

⚡ The Key Insight

WWTD (What Would Tony Do) is the law. Evals are the court system.

Having principles documented isn't enough — you need a system that consistently interprets and enforces them. WWTD tells Warren what good judgment looks like. The eval system proves whether Warren's actual outputs match that judgment.

The rubrics aren't generic "is it good?" prompts. They're Tony's specific criteria — extracted verbatim from his real production verdicts, with references to exact corpus entries. When the AI judge says PASS, it's saying "Tony would approve this." When it says FAIL, it's saying "Tony would flag this" — and it tells you exactly which criterion was violated.

📋 PROPOSAL — Awaiting Tony Approval (May 13, 2026)

🔄 Daily Self-Improving Eval Engine

Upgrade the current weekly shadow review into a daily automated loop that continuously improves Warren's judgment calibration. Every day the system gets smarter.

🎯 What Changes

	Today (Weekly)	Proposed (Daily)
Shadow Review	Saturdays 3AM — static queue of 11 entries	Every day 3AM — auto-collected from previous 24h
Output Collection	Manual — requires running shadow-collect.py with an export	Automatic — Slack API pulls Warren's outputs from all channels
Feedback Loop	Report saved to disk. No automatic action.	Failures feed back: logged, pattern-tracked, rubric refinement triggered
Corpus Growth	Manual — human adds entries to corpus	Semi-automatic — every Tony correction becomes a candidate corpus entry
Quality Gate	Pre-delivery on 5 SOPs (already live)	Same + trend data from daily reviews informs gate sensitivity

🔁 The Self-Improving Flywheel

Daily Auto-Collect (Slack API → last 24h of Warren outputs) │ ▼ Shadow Review (GLM 5.1 judges each output against domain rubric) │ ▼ ┌───────┴───────┐ │ │ PASS FAIL Trend data │ ▼ Failure Analysis │ │ │ ▼ │ Pattern Detection → Recurring failures signal rubric gaps │ │ ▼ ▼ Corpus grows Rubrics refined │ │ └───┬───┘ ▼ Better calibration tomorrow │ ▼ Fewer failures → More trust → More autonomy

⚙️ Technical Implementation

Auto-collect script: New shadow-auto-collect.sh — calls Slack API for each monitored channel, filters Warren's messages, classifies by domain, writes daily queue.
Cron change: 0 3 * * * (daily) instead of 0 3 * * 6 (Saturdays only).
Failure feedback: Every FAIL appended to gate-failures.jsonl with structured metadata. Weekly rollup flags patterns.
Corpus auto-ingestion: Tony corrections staged as candidates. Human review required before promotion — no auto-corruption of ground truth.

💡 Why This Matters

Making it daily + auto-collecting + feedback-looping turns it into a system that gets measurably better every day.

Week 1: 86 corpus entries, ~10 outputs/day evaluated
Month 1: ~130 corpus entries, rubrics refined from daily failure patterns
Month 3: 200+ corpus entries, multi-SME, measurable trend line

This is the AIPMO differentiator. Not "our agent follows instructions" — but "our agent has a provable, continuously-improving quality system calibrated to your SMEs."

✋ Decision Requested

Approve upgrading the eval system from weekly static review to daily auto-collected self-improving engine.

Scope: Auto-collect + daily cron + failure feedback loop + corpus candidate ingestion
Timeline: Core operational within 1 week. Trend dashboard within 2 weeks.
Cost: ~$0.10–0.30/day (GLM 5.1 via Together API)
Risk: Zero. Passthrough design — if anything fails, nothing breaks.

Warren Evals — Guided Tour

The Problem

The Solution

Key Design Decisions

🎯 Sales & BD

📦 Product Scope

⚖️ Effort-Value

⚙️ Process

🧠 Behavioral

The 1 Failure — Transparency Example

Sample Passes (What Good Looks Like)

Honest Caveat

🔄 Expand Gate Coverage

📈 Calibration Drift Monitoring

🧑‍🤝‍🧑 Multi-SME Expansion

🔗 CI Integration

📊 Gate Analytics Dashboard

🎯 Corpus Growth

⚡ The Key Insight

🎯 What Changes

🔁 The Self-Improving Flywheel

⚙️ Technical Implementation

💡 Why This Matters

✋ Decision Requested