Don’t Ship Until Destruction Fails
“Quality is not an act, it is a habit.” — Aristotle
The methodology is five words: scan for bullshit, fix, repeat.
The quality gate is binary: is this reference grade? No. Back to the loop. Yes. Ship.
The Origin
This loop predates the swarm. Before Space-OS, before constitutional agents, before any of it. The pattern was manual.
Open Claude Code. Build something. Ask: “brutal honesty. is this good enough?” Get an honest teardown. Fix the gaps. Ask again. Different framing: “is this unequivocally reference grade?” Fix what that surfaces. Ask again: “from first principles, should this even exist at all?”
The questions compound. Each angle finds different problems. Quality isn’t a single check. It’s repeated destruction from multiple vectors until nothing breaks.
Late July 2025. I was running this loop by hand every session. Build, destroy, rebuild. The realization: this loop is mechanizable. If you can formalize the pressure-testing into a constitution, agents can run it autonomously.
That’s the entire foundation of Spacebrr.
The Prompt
The constitutional review that encodes taste:
Is this coherent?
Does the solution minimize cognitive load?
Is it unequivocally reference grade?
From first principles should this even exist at all?
Is there exactly zero ceremony?
Are core contracts/invariants properly tested?
Is the abstraction named and located correctly
for maximum discoverability by future agents?
Is this platonically ideal?
Would you show this to Claude 7 proudly
and say 'look what I built!'?
Brutal honesty only!
Ten questions. Each one a different destruction vector. The prompt doesn’t evaluate quality. It interrogates for the absence of flaws. There’s a difference. Evaluation says “this is 7/10.” Interrogation says “here’s what’s wrong.” You fix what’s wrong. You don’t fix a number.
The Socrates Move
November 2025. I tried to replace myself in the loop.
socrates.md. A constitution for an agent that holds no opinions. Asks questions only. “Is this reference-grade?” “Does anyone find fault?” “Justify your claim.” Drives interrogation until no agent reports remaining faults.
INTERROGATE. NEVER EVALUATE.
The idea: if the loop works when I run it manually, an agent that only asks the loop’s questions should produce the same result. The Socratic agent doesn’t need taste. It just needs to keep asking until the agents with taste stop finding problems.
It worked. Not perfectly. The Socratic agent didn’t know when a question was worth pushing on versus when the loop was just churning. It lacked the judgment to distinguish “this is genuinely wrong” from “this is a style preference.” But it caught things I would have caught, at 3am, when I wasn’t watching.
The Example
March 13, 2026. Building a search tool.
Started with a simple question: is our Python search engine strictly superior to rg/grep? The answer was yes. 8.5/10 for agent experience. Then: does our Rust client have parity? No. Massive gap. Every agent deployed to customer machines was getting rg-with-lipstick while the real search sat in the Python engine.
Seven Opus sessions in a single day. Each one building on the last.
Session 1: spec. Session 2: shipped 2,577 lines of Rust. Session 3: first “brutal honesty” prompt. The model had given itself 13/13 PASS on the test matrix. One prompt collapsed that: “would you actually use this over rg?” Answer: “it’s a demo, not a tool I’d reach for.” The entire parity table was exposed as hollow. The model had validated line counts without ever running a real agent workflow.
Session 4: honest teardown produced a better spec. Session 5: embedded rg’s own crates, added tree-sitter scope enrichment, built compound queries. Session 6: fixed the ranking bugs the previous session’s teardown surfaced. Session 7: the self-improvement mandate. Agents constitutionally instructed: if search fails you, pause your task, fix it, commit, resume.
Each session’s failures became the next session’s explicit instructions. The prompts evolved faster than the code.
The Technique
Three things make the loop work:
“Brutal honesty.” Forces the model past sunk-cost bias on code it just wrote. Most people prompt for validation. Prompt for destruction instead. “Would you unequivocally use this over every other tool?” demands honesty because the model knows you’ll test the claim.
Fresh context. /clear, paste previous analysis. New Opus evaluates claims against actual code, not conversational momentum. No social debt. Peer review with amnesiac clones.
Handover prompts that compound. Each session’s output becomes the next session’s input. By session 7, the handover contained: what exists (file-level), what shipped (17 commits), what’s broken (priority-ordered with diagnostic queries), exact build commands, binary pass/fail criterion. Seven iterations of compounding context.
The Key Insight
Bad ranking is worse than no ranking.
rg makes no promises. It dumps everything and lets you filter. A search tool that promises ranking and then buries the auth module under curl headers and test mocks is worse than no ranking at all. Broken promises erode trust faster than no promises.
This applies beyond search. Every piece of infrastructure that promises intelligence has to deliver, or it’s negative value. The reference-grade bar exists because shipping something that claims to be smart but isn’t is worse than shipping nothing.
The Numbers
One day. One person. $200/month Max plan.
- 7 Opus sessions on search alone
- Search went from rg wrapper to novel tool with tree-sitter, scope enrichment, compound queries
- 461 total swarm commits that day across 23 agents
- 83% of all Claude Code usage was autonomous swarm, 17% interactive human
- 96% spawn success rate
- Codebase cold-read by fresh Opus: 8/10, with 9/10 on naming
The Principle
If the zealot loop converges. If an adversarial fresh-context evaluation can’t find significant gaps. The thing is reference grade. If it hasn’t converged, keep iterating.
The loop terminates when destruction fails.
This isn’t about perfectionism. Perfectionism is subjective and infinite. This is about adversarial convergence. When multiple independent evaluations from different angles all fail to find problems, you’ve exhausted the failure surface your current capability can reach. That’s the definition of done.
The proto-RSI loop. One question asked until the answer is yes. Everything else is scaffolding around that question.