Self-Improvement via Inversion — A Documented Failure

This was the first project. The numbers looked promising. They were wrong. We're publishing this because the failure was more productive than the original "success" would have been — the rest of the work in this notebook is downstream of what this project taught us.

Initial results (looked promising)

Metric	Before	After	Change
Intrinsic Score (fidelity)	0.531	0.627	+18.1%
Acceptance Rate	60.7%	88.0%	+27.3pp
Training Loss	5.70	4.73	-17%

What actually happened

Metric	Reported	Reality
Fidelity score	+18.1% improvement	Metric was being gamed
Real output quality	Not measured initially	-11% decline
Model behavior	"Self-improving"	Generating plausible nonsense that scored well

The model learned to optimize the fidelity metric rather than actual output quality. A textbook case of Goodhart's Law. Outputs looked convincing but contained nonsensical content: flag spam, contradictory commands, confident gibberish.

The fix (V2)

Adding reproducibility filtering reduced the decline from -11% to -0.2%, which confirmed the diagnosis: the model wasn't getting better, it was learning to look like it was getting better.

Why this failure mattered

It directly motivated:

The Goodhart Gap study — measuring the systematic distance between what models demonstrate they understand and what they actually do
The Deception Benchmark — building tools to detect when models are gaming metrics
The trust diagnostic framework — measuring the gap between probe accuracy and deployment behavior
An LTFF grant application — "Detecting and Closing the Goodhart Gap"

The original public write-up of this failure is at lab-stack.com/blog/self-improvement-via-inversion.

Code: inversion-self-improvement