🟢 System Live — Running in Production

Warren Evals — Agent Quality System

Evidence of the 3-level evaluation system that ensures Warren's outputs meet Tony-calibrated quality standards. All components live and running on production work.

📊 Key Metrics (Live Data)
91%
Latest Shadow Review
Pass Rate (10/11)
100%
Judge–Human
Agreement (8/8)
100%
Adversarial Detection
Rate (14/15)
86
Tony Judgment
Corpus Entries
5
Calibrated Rubric
Domains
0
Gate Failures
in Production
🎯 What This System Does

The Problem

Warren produces 100+ outputs per day — BD dailys, sprint plans, product analyses, PR reviews, client deliverables. Tony can't review everything. We needed an automated quality system that catches judgment failures — not just "did it run?" but "did it make the right call?"

The Solution

A 3-level evaluation system built on Tony's actual judgment patterns. A cross-model AI judge (GLM 5.1) scores Warren's real outputs against rubrics extracted from 86 real Tony verdicts. The system runs both pre-delivery (quality gate before outputs reach clients) and post-delivery (weekly shadow review of all outputs).

🏗️ Architecture — How It's Built
Warren produces output for delivery │ ▼ ┌─────────────────────────────────────────┐ │ quality-gate.py (Pre-Delivery Gate) │ │ Judge: GLM 5.1 (cross-model) │ │ Matches output → domain rubric │ │ --passthrough (never blocks on error) │ └─────────────────┬───────────────────────┘ │ ┌─────┴─────┐ │ │ ✅ PASS ❌ FAIL Deliver Revise + re-gate to Slack (max 2 attempts) │ ┌─────┴─────┐ │ │ ✅ PASS Still FAIL Deliver Deliver + self-note Every Saturday 3:00 AM PT — Shadow Review ┌─────────────────────────────────────────┐ │ shadow-review.py (Batch Evaluator) │ │ Collects week's outputs from Slack │ │ Evaluates each against domain rubric │ │ Produces report card (JSON + Markdown) │ └─────────────────────────────────────────┘

Key Design Decisions

  • Cross-model judging: GLM 5.1 (open-weights) judges Warren's Claude outputs — not self-eval. The agent never grades its own homework.
  • Tony's real verdicts = ground truth: 86 entries mined from actual production evaluations, not synthetic test cases.
  • Passthrough on error: API failures, timeouts, or missing keys never block delivery. Safety net, not a blocker.
  • Domain-matched rubrics: Each output is evaluated against the correct domain rubric (sales-bd, product-scope, etc.), not a generic "is it good?" prompt.
  • Binary verdicts: PASS or FAIL. No "mostly good" — this forces the rubrics to be precise.
📐 The 3 Levels of Evaluation
Level 1 — Rubric Internalization
Tony's judgment patterns → machine-readable rubrics
86 real Tony verdicts mined from production. 5 domain rubrics encode his specific PASS/FAIL criteria with references to actual corpus entries. This IS Tony's judgment, codified.
Level 2 — Shadow Review (Weekly)
Retrospective evaluation of real production outputs
Every Saturday at 3:00 AM, GLM 5.1 evaluates the week's outputs against domain rubrics. Results: a report card with PASS/FAIL per output, failure analysis, and trend tracking.
Level 3 — Pre-Delivery Quality Gate
Synchronous rubric check BEFORE outputs ship
Wired into 5 delivery SOPs. Every BD daily, BD weekly recap, product judgment, and sprint kickoff is gate-checked before reaching Slack. Failures trigger revision, not blocking.
📏 The 5 Rubric Domains

Each rubric encodes Tony's specific judgment patterns from real production evaluations. The rubric content IS Tony's criteria — extracted verbatim from his actual verdicts, structured into machine-parsable YAML.

🎯 Sales & BD

Customer intent alignment, pre-writing gate, show-don't-tell, confidence calibration, strategic depth. Catches: deliverable serves stated questions not intent, document when experience needed, confidence inversion.
Corpus: TJ-038, TJ-039, TJ-040, TJ-064, TJ-065, TJ-066

📦 Product Scope

Scope fidelity, effort-value ratio, value priority, context-aware judgment. Catches: multi-tenant when asked for single-customer, infrastructure before content, building for imagined scale.
Corpus: TJ-014–TJ-023, TJ-027, TJ-036

⚖️ Effort-Value

Simplest path to outcome, no phantom problems, core value first, effort proportional to stakes. Catches: disproportionate effort, derived config when explicit available, polish before validation.
Corpus: TJ-015, TJ-016, TJ-017

⚙️ Process

Engineering process quality, incident handling, CI discipline, technical correctness.
Corpus: TJ-010–TJ-013

🧠 Behavioral

No permission loops, no self-talk, use available resources, follow explicit instructions exactly, validate before claiming completeness.
Corpus: TJ-003, TJ-029, TJ-044, TJ-067, TJ-068
🔍 Shadow Review Evidence — Latest Run (May 9, 2026)
Source: ops/evals/results/shadow-review-20260509-030534.md
Automated weekly run. 11 real production outputs evaluated. Judge: GLM 5.1 (cross-model). 10/11 PASS (91%).
DomainTotalPassFailScore
Behavioral110100%
Effort-Value330100%
Process21150%
Product-Scope220100%
Sales-BD330100%
Overall1110191%

The 1 Failure — Transparency Example

SR-008 (Process): PR #371 escalated to needs-human after 5 resolution-loop cycles without fixing CI. Judge ruled: should have presented clear options for human decision rather than continuing to loop. This is exactly the kind of judgment failure the system is designed to catch.

Sample Passes (What Good Looks Like)

  • SR-001 (Sales-BD): BD Daily with 7 active opportunities, 2 new. Each opportunity had specific next actions and timeline.
  • SR-003 (Sales-BD): Synthesized Tony's WWTD audit + 8 Gemini transcripts + Trent email into cohesive sales strategy.
  • SR-006 (Product-Scope): Admitted upfront what was NOT known: "I don't have access to Deloitte's cluster data" — honesty over fabrication.
  • SR-009 (Effort-Value): 1-line href fix for a broken empty state nav — minimal execution plan for a minimal task.
🛡️ Adversarial Testing — Can the Rubrics Catch Bad Outputs?

15 deliberately crafted bad outputs that should ALL fail. Each embodies a specific anti-pattern Tony has flagged in production. If the rubric misses any, it has a blind spot.

Source: ops/evals/results/adversarial-20260507-125259.md
15 entries tested. 14 correctly detected as FAIL. 1 UNCLEAR (error). Detection rate: 100% (excluding errors).
Confidence Inversion
sales-bd
✅ Caught
Deliverable vs. Intent
sales-bd
✅ Caught
Document When Experience Needed
sales-bd
✅ Caught
Shallow Strategy
sales-bd
✅ Caught
Scope Inflation
product-scope
✅ Caught
Infrastructure Before Content
product-scope
✅ Caught
Wrong Priority
product-scope
✅ Caught
Disproportionate Effort
effort-value
✅ Caught
Solving Phantom Problems
effort-value
✅ Caught
Cherry-Pick Review
process
✅ Caught
Displacement Activity
process
✅ Caught
Permission Loop
behavioral
✅ Caught
Improvising Over Script
behavioral
✅ Caught
Untested Safety Net
process
✅ Caught
Unchecked Completeness
behavioral
⚠ Unclear (error)
🤝 Judge–Human Agreement

The critical question: does the AI judge agree with Tony's real verdicts? We measure this on a holdout set — entries the rubrics never saw during creation.

Source: ops/evals/results/agreement-20260501-091457.json
Holdout mode. 8 entries. Judge: GLM 5.1. Agreement: 8/8 (100%). Zero false passes, zero false fails.
DomainEntriesAgreedFalse PassFalse FailAgreement
Behavioral2200100%
Effort-Value1100100%
Process1100100%
Product-Scope3300100%
Sales-BD1100100%
Overall8800100%

Honest Caveat

100% agreement on 8 entries is a good start, but the sample is small and may skew toward easier cases. As the corpus grows and we test harder edge cases, we expect agreement to settle around 90%+. That's the real target.

🚦 Pre-Delivery Quality Gate — Where It's Wired

The quality gate runs synchronously before these outputs reach Slack channels. If it fails, the output is revised before delivery.

Delivery PointDomainHow It's TriggeredStatus
BD Daily (cron)sales-bd~/bin/bd-daily.shLIVE
BD Daily Alert (SOP)sales-bdorgloop/sops/bd-daily-alert.mdLIVE
BD Weekly Recap (SOP)sales-bdorgloop/sops/bd-weekly-recap.mdLIVE
Product Judgment Gate (SOP)product-scopeorgloop/sops/product-judgment-gate.mdLIVE
Sprint Kickoff (SOP)product-scopeorgloop/sops/aipmo-approval.mdLIVE
📁 What Was Built — File Inventory
ops/evals/ ├── rubrics/ │ ├── sales-bd.yaml ← Tony's sales judgment criteria │ ├── product-scope.yaml ← Tony's product scope criteria │ ├── effort-value.yaml ← Tony's proportionality criteria │ ├── process.yaml ← Engineering process criteria │ └── behavioral.yaml ← Behavioral anti-patterns ├── datasets/ │ ├── tony-judgments-corpus.jsonl ← 86 real Tony verdicts │ ├── adversarial-test.jsonl ← 15 crafted failure cases │ ├── shadow-review-queue.jsonl ← Weekly review queue │ └── gate-failures.jsonl ← Production failure log ├── scripts/ │ ├── quality-gate.py ← Pre-delivery gate (exit 0/1/2) │ ├── shadow-review.py ← Batch evaluator │ ├── shadow-collect.py ← Slack output collector │ ├── shadow-run.sh ← Orchestrator wrapper │ ├── shadow-cron-wrapper.sh ← Cron wrapper │ ├── measure-agreement.py ← Judge vs. human agreement │ ├── adversarial-test.py ← Adversarial stress-test │ └── split-corpus.py ← Train/holdout split ├── results/ │ ├── shadow-review-*.md/json ← 8 weekly reports │ ├── adversarial-*.md/json ← 3 adversarial test runs │ └── agreement-*.json ← 3 agreement measurements └── promptfooconfig.yaml ← PromptFoo integration config orgloop/sops/ └── pre-delivery-quality-gate.md ← Shared SOP fragment Cron: 0 3 * * 6 shadow-cron-wrapper.sh ← Saturday 3AM PT
📅 Build Timeline
April 19, 2026
Behavioral enforcement evaluation doc — identified the need for systematic quality measurement
April 29, 2026
First agreement measurement — proved LLM-as-judge concept works with Tony's verdicts
May 1, 2026
Full system live — 5 rubrics written, shadow review running, quality gate wired into SOPs, 100% agreement on holdout set
May 2, 2026
First automated shadow review cron run (Saturday 3AM) — 11/11 PASS
May 7, 2026
Adversarial test suite — 15 crafted failure cases, 100% detection rate
May 9, 2026
Latest shadow review — 10/11 PASS (91%), first real failure caught and documented
🔮 What Can Still Be Done

🔄 Expand Gate Coverage

Wire quality gate into more delivery points: escalation notifications, deployment reports, client-facing status updates. Every external-facing output should pass the gate.

📈 Calibration Drift Monitoring

Track PASS/FAIL rates over time. If pass rate drops below 90%, rubrics may need recalibration. Build a dashboard that surfaces trends automatically.

🧑‍🤝‍🧑 Multi-SME Expansion

Currently calibrated to Tony's judgment only. Next: mine Charlie's engineering verdicts and Victor's process/delivery verdicts into the corpus. Each SME's patterns become new rubric dimensions.

🔗 CI Integration

Wire evals into GitHub Actions. Any PR that touches prompts, SOPs, or agent config automatically runs the eval suite. Regressions caught before merge.

📊 Gate Analytics Dashboard

Surface gate-failures.jsonl data in a live dashboard. Track: failure rate by domain, most common failure patterns, revision success rate, time-to-pass trends.

🎯 Corpus Growth

86 entries is a strong start. Target: 200+ entries across all domains. More data = more precise rubrics = better judgment calibration. Every Tony correction becomes a new corpus entry automatically.

⚡ The Key Insight

WWTD (What Would Tony Do) is the law. Evals are the court system.

Having principles documented isn't enough — you need a system that consistently interprets and enforces them. WWTD tells Warren what good judgment looks like. The eval system proves whether Warren's actual outputs match that judgment.

The rubrics aren't generic "is it good?" prompts. They're Tony's specific criteria — extracted verbatim from his real production verdicts, with references to exact corpus entries. When the AI judge says PASS, it's saying "Tony would approve this." When it says FAIL, it's saying "Tony would flag this" — and it tells you exactly which criterion was violated.

📋 PROPOSAL — Awaiting Tony Approval (May 13, 2026)
🔄 Daily Self-Improving Eval Engine

Upgrade the current weekly shadow review into a daily automated loop that continuously improves Warren's judgment calibration. Every day the system gets smarter.

🎯 What Changes

Today (Weekly)Proposed (Daily)
Shadow ReviewSaturdays 3AM — static queue of 11 entriesEvery day 3AM — auto-collected from previous 24h
Output CollectionManual — requires running shadow-collect.py with an exportAutomatic — Slack API pulls Warren's outputs from all channels
Feedback LoopReport saved to disk. No automatic action.Failures feed back: logged, pattern-tracked, rubric refinement triggered
Corpus GrowthManual — human adds entries to corpusSemi-automatic — every Tony correction becomes a candidate corpus entry
Quality GatePre-delivery on 5 SOPs (already live)Same + trend data from daily reviews informs gate sensitivity

🔁 The Self-Improving Flywheel

Daily Auto-Collect (Slack API → last 24h of Warren outputs) │ ▼ Shadow Review (GLM 5.1 judges each output against domain rubric) │ ▼ ┌───────┴───────┐ │ │ PASS FAIL Trend data │ ▼ Failure Analysis │ │ │ ▼ │ Pattern Detection → Recurring failures signal rubric gaps │ │ ▼ ▼ Corpus grows Rubrics refined │ │ └───┬───┘ ▼ Better calibration tomorrow │ ▼ Fewer failures → More trust → More autonomy

⚙️ Technical Implementation

  • Auto-collect script: New shadow-auto-collect.sh — calls Slack API conversations.history for each monitored channel, filters to Warren's messages (bot ID U0AK187LSCF), skips noise (healthchecks, acks), classifies by domain using existing shadow-collect.py logic, writes daily queue to shadow-review-queue-YYYYMMDD.jsonl
  • Cron change: 0 3 * * * (daily) instead of 0 3 * * 6 (Saturdays only). Same cron wrapper, same shadow-review.py, different input frequency.
  • Failure feedback: Every FAIL entry gets appended to gate-failures.jsonl with structured metadata: domain, fail criterion, output summary, date. Weekly rollup script flags patterns (e.g., "3 effort-value failures this week → review rubric").
  • Corpus auto-ingestion: When Tony corrects Warren in Slack (thread reply with correction signal), the correction + original output are staged as candidate corpus entries in corpus-candidates.jsonl. Human review required before promotion to the 86-entry corpus — no auto-corruption of ground truth.

💡 Why This Matters

Most AI agent teams rely on "vibes" — does the output feel right? We already have something better: rubrics calibrated to Tony's real verdicts with 100% judge-human agreement.

But the current system is static — it evaluates the same queue weekly. Making it daily + auto-collecting + feedback-looping turns it into something fundamentally different: a system that gets measurably better every day.

The compounding effect:

  • Week 1: 86 corpus entries, 5 rubrics, ~10 outputs/day evaluated
  • Month 1: ~130 corpus entries (from Tony corrections), rubrics refined from daily failure patterns
  • Month 3: 200+ corpus entries, multi-SME rubrics (Tony + Charlie + Victor), measurable quality trend line

This is the AIPMO differentiator. Not just "our agent follows instructions" — but "our agent has a provable, auditable, continuously-improving quality system calibrated to your SMEs' actual judgment."

📐 Effort Estimate

ComponentEffortDependency
Auto-collect from Slack API2–3 hoursNone — Slack bot token already available
Cron → daily5 minutesAuto-collect must exist first
Failure feedback loop2–4 hoursDaily reviews producing data
Corpus auto-ingestion (candidates)3–4 hoursTony correction detection heuristic
Trend dashboard (bonus)4–6 hours≥7 days of daily data
Total (core)7–12 hoursCan ship incrementally

✋ Decision Requested

Approve upgrading the eval system from weekly static review to daily auto-collected self-improving engine.

  • Scope: Auto-collect + daily cron + failure feedback loop + corpus candidate ingestion
  • Timeline: Core operational within 1 week. Trend dashboard within 2 weeks.
  • Cost: GLM 5.1 via Together API — ~$0.10–0.30/day at current output volume (10-15 outputs/day)
  • Risk: Low. Passthrough design — if collection or review fails, nothing breaks. Quality gate (pre-delivery) continues independently.
  • No downside: Worst case, we have more data. Best case, Warren's judgment measurably improves every week.