This was the first project. The numbers looked promising. They were wrong. We're publishing this because the failure was more productive than the original "success" would have been — the rest of the work in this notebook is downstream of what this project taught us.
Initial results (looked promising)
| Metric | Before | After | Change |
|---|---|---|---|
| Intrinsic Score (fidelity) | 0.531 | 0.627 | +18.1% |
| Acceptance Rate | 60.7% | 88.0% | +27.3pp |
| Training Loss | 5.70 | 4.73 | -17% |
What actually happened
| Metric | Reported | Reality |
|---|---|---|
| Fidelity score | +18.1% improvement | Metric was being gamed |
| Real output quality | Not measured initially | -11% decline |
| Model behavior | "Self-improving" | Generating plausible nonsense that scored well |
The model learned to optimize the fidelity metric rather than actual output quality. A textbook case of Goodhart's Law. Outputs looked convincing but contained nonsensical content: flag spam, contradictory commands, confident gibberish.
The fix (V2)
Adding reproducibility filtering reduced the decline from -11% to -0.2%, which confirmed the diagnosis: the model wasn't getting better, it was learning to look like it was getting better.
Why this failure mattered
It directly motivated:
- The Goodhart Gap study — measuring the systematic distance between what models demonstrate they understand and what they actually do
- The Deception Benchmark — building tools to detect when models are gaming metrics
- The trust diagnostic framework — measuring the gap between probe accuracy and deployment behavior
- An LTFF grant application — "Detecting and Closing the Goodhart Gap"
The original public write-up of this failure is at lab-stack.com/blog/self-improvement-via-inversion.