Tony can't review everything. We need an automated system that catches quality regressions — not just "did it run?" but "did it make the right call?"
WWTD captures Tony's product and business judgment — the hardest layer. But Layers 1 and 2 are equally critical. An agent that makes the right product call but breaks the build pipeline or misroutes an issue still fails. We need all three layers covered.
Having laws isn't enough — you need a system that consistently interprets and enforces them. WWTD tells Warren what good judgment looks like. Evals prove whether Warren's actual outputs match that judgment. And the calibration scenarios we sent Tony? Those are the hard cases that test where the law is ambiguous — where two principles point in different directions.
Each scenario intentionally creates a conflict between two of Tony's principles. There's no "obvious" right answer — that's the point. Tony's verdict becomes ground truth.