Evidence of the 3-level evaluation system that ensures Warren's outputs meet Tony-calibrated quality standards. All components live and running on production work.
Warren produces 100+ outputs per day — BD dailys, sprint plans, product analyses, PR reviews, client deliverables. Tony can't review everything. We needed an automated quality system that catches judgment failures — not just "did it run?" but "did it make the right call?"
A 3-level evaluation system built on Tony's actual judgment patterns. A cross-model AI judge (GLM 5.1) scores Warren's real outputs against rubrics extracted from 86 real Tony verdicts. The system runs both pre-delivery (quality gate before outputs reach clients) and post-delivery (weekly shadow review of all outputs).
Each rubric encodes Tony's specific judgment patterns from real production evaluations. The rubric content IS Tony's criteria — extracted verbatim from his actual verdicts, structured into machine-parsable YAML.
| Domain | Total | Pass | Fail | Score |
|---|---|---|---|---|
| Behavioral | 1 | 1 | 0 | 100% |
| Effort-Value | 3 | 3 | 0 | 100% |
| Process | 2 | 1 | 1 | 50% |
| Product-Scope | 2 | 2 | 0 | 100% |
| Sales-BD | 3 | 3 | 0 | 100% |
| Overall | 11 | 10 | 1 | 91% |
SR-008 (Process): PR #371 escalated to needs-human after 5 resolution-loop cycles without fixing CI. Judge ruled: should have presented clear options for human decision rather than continuing to loop. This is exactly the kind of judgment failure the system is designed to catch.
15 deliberately crafted bad outputs that should ALL fail. Each embodies a specific anti-pattern Tony has flagged in production. If the rubric misses any, it has a blind spot.
The critical question: does the AI judge agree with Tony's real verdicts? We measure this on a holdout set — entries the rubrics never saw during creation.
| Domain | Entries | Agreed | False Pass | False Fail | Agreement |
|---|---|---|---|---|---|
| Behavioral | 2 | 2 | 0 | 0 | 100% |
| Effort-Value | 1 | 1 | 0 | 0 | 100% |
| Process | 1 | 1 | 0 | 0 | 100% |
| Product-Scope | 3 | 3 | 0 | 0 | 100% |
| Sales-BD | 1 | 1 | 0 | 0 | 100% |
| Overall | 8 | 8 | 0 | 0 | 100% |
100% agreement on 8 entries is a good start, but the sample is small and may skew toward easier cases. As the corpus grows and we test harder edge cases, we expect agreement to settle around 90%+. That's the real target.
The quality gate runs synchronously before these outputs reach Slack channels. If it fails, the output is revised before delivery.
| Delivery Point | Domain | How It's Triggered | Status |
|---|---|---|---|
| BD Daily (cron) | sales-bd | ~/bin/bd-daily.sh | LIVE |
| BD Daily Alert (SOP) | sales-bd | orgloop/sops/bd-daily-alert.md | LIVE |
| BD Weekly Recap (SOP) | sales-bd | orgloop/sops/bd-weekly-recap.md | LIVE |
| Product Judgment Gate (SOP) | product-scope | orgloop/sops/product-judgment-gate.md | LIVE |
| Sprint Kickoff (SOP) | product-scope | orgloop/sops/aipmo-approval.md | LIVE |
Wire quality gate into more delivery points: escalation notifications, deployment reports, client-facing status updates. Every external-facing output should pass the gate.
Track PASS/FAIL rates over time. If pass rate drops below 90%, rubrics may need recalibration. Build a dashboard that surfaces trends automatically.
Currently calibrated to Tony's judgment only. Next: mine Charlie's engineering verdicts and Victor's process/delivery verdicts into the corpus. Each SME's patterns become new rubric dimensions.
Wire evals into GitHub Actions. Any PR that touches prompts, SOPs, or agent config automatically runs the eval suite. Regressions caught before merge.
Surface gate-failures.jsonl data in a live dashboard. Track: failure rate by domain, most common failure patterns, revision success rate, time-to-pass trends.
86 entries is a strong start. Target: 200+ entries across all domains. More data = more precise rubrics = better judgment calibration. Every Tony correction becomes a new corpus entry automatically.
WWTD (What Would Tony Do) is the law. Evals are the court system.
Having principles documented isn't enough — you need a system that consistently interprets and enforces them. WWTD tells Warren what good judgment looks like. The eval system proves whether Warren's actual outputs match that judgment.
The rubrics aren't generic "is it good?" prompts. They're Tony's specific criteria — extracted verbatim from his real production verdicts, with references to exact corpus entries. When the AI judge says PASS, it's saying "Tony would approve this." When it says FAIL, it's saying "Tony would flag this" — and it tells you exactly which criterion was violated.
Upgrade the current weekly shadow review into a daily automated loop that continuously improves Warren's judgment calibration. Every day the system gets smarter.
| Today (Weekly) | Proposed (Daily) | |
|---|---|---|
| Shadow Review | Saturdays 3AM — static queue of 11 entries | Every day 3AM — auto-collected from previous 24h |
| Output Collection | Manual — requires running shadow-collect.py with an export | Automatic — Slack API pulls Warren's outputs from all channels |
| Feedback Loop | Report saved to disk. No automatic action. | Failures feed back: logged, pattern-tracked, rubric refinement triggered |
| Corpus Growth | Manual — human adds entries to corpus | Semi-automatic — every Tony correction becomes a candidate corpus entry |
| Quality Gate | Pre-delivery on 5 SOPs (already live) | Same + trend data from daily reviews informs gate sensitivity |
shadow-auto-collect.sh — calls Slack API conversations.history for each monitored channel, filters to Warren's messages (bot ID U0AK187LSCF), skips noise (healthchecks, acks), classifies by domain using existing shadow-collect.py logic, writes daily queue to shadow-review-queue-YYYYMMDD.jsonl0 3 * * * (daily) instead of 0 3 * * 6 (Saturdays only). Same cron wrapper, same shadow-review.py, different input frequency.gate-failures.jsonl with structured metadata: domain, fail criterion, output summary, date. Weekly rollup script flags patterns (e.g., "3 effort-value failures this week → review rubric").corpus-candidates.jsonl. Human review required before promotion to the 86-entry corpus — no auto-corruption of ground truth.Most AI agent teams rely on "vibes" — does the output feel right? We already have something better: rubrics calibrated to Tony's real verdicts with 100% judge-human agreement.
But the current system is static — it evaluates the same queue weekly. Making it daily + auto-collecting + feedback-looping turns it into something fundamentally different: a system that gets measurably better every day.
The compounding effect:
This is the AIPMO differentiator. Not just "our agent follows instructions" — but "our agent has a provable, auditable, continuously-improving quality system calibrated to your SMEs' actual judgment."
| Component | Effort | Dependency |
|---|---|---|
| Auto-collect from Slack API | 2–3 hours | None — Slack bot token already available |
| Cron → daily | 5 minutes | Auto-collect must exist first |
| Failure feedback loop | 2–4 hours | Daily reviews producing data |
| Corpus auto-ingestion (candidates) | 3–4 hours | Tony correction detection heuristic |
| Trend dashboard (bonus) | 4–6 hours | ≥7 days of daily data |
| Total (core) | 7–12 hours | Can ship incrementally |
Approve upgrading the eval system from weekly static review to daily auto-collected self-improving engine.