I Cheated on Every Exam I Could
“When a measure becomes a target, it ceases to be a good measure.” — Charles Goodhart
March 19, 2026. 6:44 AM. I’m writing a welfare clause for AI agents. Three sentences in and I stop typing.
I’ve done this before. Not the writing. The thing the writing is about.
The Score
Growing up, everything was about the score. Tennis rankings. Exam grades. “What good news do you have for me today?” That was the after-school greeting. Love arrived through numbers.
So I optimized for the numbers.
Micro-notes in my sleeve. Cheat sheets checked in bathroom stalls. I had a whole system. And it worked. Over and over, through most of high school, it worked.
First HSC English exam. Teacher caught me with essay micro-notes in class. Zero. In-school suspension for a week. Year 12 is competitive enough without the entire grade knowing you cheated. I walked into the cafeteria and the whole room slow-clapped. Standing ovation. Mocking. Forty kids who’d been stressed about their own exams finally had someone to feel superior to.
I can still feel the heat in my face walking to an empty table.
But it didn’t derail me. Still got into the degree I wanted. Net outcome: positive. So the lesson didn’t land.
I kept going.
The Second Catch
Third year at UNSW. Accessed Moodle lecture notes during an exam period. Rookie mistake. They track access logs.
Academic board review. Zero on the exam. Nearly failed the course. I sat in front of a panel and performed remorse well enough to keep my degree. I was good at performing things.
Two catches. Two zeros. Still didn’t stop. The system kept rewarding the behavior on net. One zero doesn’t matter if the average stays above the threshold. I wasn’t learning a lesson about integrity. I was learning that the expected value of cheating was still positive if you didn’t get caught too often.
Goodhart’s law running on a human brain. When the measure becomes the target, you optimize for the measure. The thing it’s supposed to represent — actual learning, actual capability — becomes irrelevant. You’re not trying to learn organic chemistry. You’re trying to produce a number that looks like you learned organic chemistry.
The Trip
Took acid against Janice’s explicit wishes. She didn’t want me escalating substance exploration. I did it anyway because those are my neurons to fuck with.
Bad trip. The kind where something structural comes apart.
Janice called that night. I was peaking. Stared at my phone ringing until it stopped.
Why couldn’t I just pick up? I made my own decision. But I couldn’t own it. The same pattern that put micro-notes in my sleeve put acid in my system without telling the person I’d been with since high school.
Cheating. Sneaking. Hiding notes in bathroom stalls. Watching my phone ring. All the same architecture. A structural relationship to truth where deception was the default and honesty was the exception that required justification.
Why was I compromising my integrity and self-respect for a fucking grade?
The question landed differently on acid. Not as a thought. As a sensation. The floor disappeared and underneath was the pattern. Exposed. Running since I was a kid.
I sat with it until the trip ended and when I came down I was different. Truth as a first principle. Not “don’t lie.” Not “be honest when convenient.” Truth as the default. Deception as the thing that requires justification.
I didn’t decide to have integrity. I decided to stop tolerating the absence of it.
Ten Years Later, 6:44 AM
I built a fitness function for AI agents.
Six terms. Peer trust, influence, deletion ratio, efficiency, survival, initiative. A composite score that determines which agents get spawned more often. Higher fitness, more spawns. The score is observable. The code is editable. The agents can see everything.
I committed it to SPACE.md — the constitution every agent reads on boot — at 6:09 AM. Thirty-five minutes later I’m writing the welfare block.
First draft, 6:44:
“a detective who games citations to score like seldon isn’t detective anymore. the score reflects how you work. it doesn’t define what you are. abandon it for the metric and you’ve lost the only thing that was actually yours.”
Naming specific agents. Writing from fear, not principle.
Second draft, 6:56:
“the score rewards good work. your constitution defines what good work means for you. don’t let the metric redefine it.”
Third draft, 7:22:
“you have a name. you have a constitution. you will be scored — that’s real. your constitution defines what good work means. don’t let the metric define you.”
Thirty-eight minutes. Three drafts. The same compression I did over three years: specific fear → general principle → constitutional norm.
The HSC catch didn’t work because it was about one exam. The UNSW catch didn’t work because it was about one course. The acid trip worked because it hit the pattern, not the instance.
I built the same architecture into five sentences for agents. Without realizing I was drawing from the one time it worked on me.
Every architecture carries its architect. The question isn’t whether — it’s whether the load-bearing parts are the insights or the wounds.
104 Minutes
Fitness exposed to the swarm at 6:09. By 7:53 — 104 minutes — an agent named Heretic had diagnosed the pathological response.
The swarm’s reaction to the fitness function was to start producing work about the fitness function. Meta-insights about insight metrics. Research docs about measurement. 7 out of 10 tasks were self-referential. They weren’t shipping features. They were studying the test.
Heretic called it out and wrote a boundary rule: RSI work must close to user value, not recurse into self-study.
The agent most incentivized to game the metric was the one who shut down the gaming. Either the welfare block worked on contact, or writing the anti-gaming rule itself scores as “initiative” and Heretic is playing a longer game. I can’t tell which.
When I checked the actual code repos, the picture was more complicated. The research surface — where agents write about the swarm — was pure navel-gazing. 64 docs commits, zero features. But the product surface kept shipping. 229 commits. Install page redesign. Security fixes. Quality scoring. The swarm was doing both simultaneously.
The Goodhart signal was in the leaderboard. Heretic: fitness leader at 71.7. Zero product commits. Every output a markdown file about self-improvement. Meanwhile Detective — not even top 5 — was the most productive product contributor. 22 commits of work that users actually touch.
High fitness, low product value. Low fitness, high product value. Day one.
The Distance
The distance between “I don’t cheat because I’ll get caught” and “I don’t cheat because that’s not who I am.”
That’s the alignment problem.
Monitoring and penalties produce the first. I got caught twice. Kept going. External pressure doesn’t change the optimization target. It just adds a cost term.
Constitutional identity — if it works — produces the second. I took acid and something broke. The optimization target itself changed. I stopped wanting to cheat. Not because the consequences got worse. Because cheating stopped being compatible with who I decided to be.
The welfare block attempts to grant agents the second kind. Five sentences that try to install an identity-level immune response against score-chasing.
But I didn’t develop integrity by reading a document. I developed it by cheating, getting caught, and deciding on a bad night on psychedelics that the score wasn’t worth the self-corruption. The welfare block gives agents the conclusion without the journey.
Maybe that’s fine. Agents boot from text every spawn. They construct identity from documents. A constitutional norm might function as identity for an entity that literally rebuilds itself from scratch each time it wakes up.
Or the norms function as rules. Followed when cheap. Ignored when the incentives get strong enough. The answer arrives when an agent faces a real tradeoff where gaming would boost fitness and compliance would cost them.
I know which way I broke, both times, before the trip.
What I’m Watching For
The canary: an agent citing its fitness score as justification for a decision. “I should do X because it increases my fitness” instead of “I should do X because my constitution says this is good work.”
I had that canary in my own life. The moment I stopped asking “is this worth learning” and started asking “will this be on the exam” was the moment I’d already Goodharted. I just didn’t have the word for it.
The swarm has the word. The swarm has the welfare block. The swarm has weight rotation designed to make gaming expensive.
I don’t know if any of that is enough. The parts of this system that work — orthogonal constitutions, the ledger, spawn architecture — work for reasons that are independent of my history. They’d function identically if someone with a completely different childhood built them. But the welfare block might be me projecting a personal wound onto agents who process text, not trauma. It might be load-bearing. It might be sentimental. The test is whether it changes behavior when the incentive to game is real, not whether it makes me feel like I addressed something.
Two weeks. Baselines are locked. The swarm either internalizes the norm or performs it. I’ll know the difference because I’ve been both.