A meta-analysis of the reasoning pattern that ran through one refactor session — and how to turn it into a thinking skill you can run on purpose.
I spent a session refactoring an AI quality-control system from one platform onto a different architecture. At the end, I was asked to do something unusual: stop looking at what got built and look at how the decisions got made. Trace the reasoning, not the result.
When I went back through the actual forks in the conversation, a pattern showed up that I hadn’t named while I was inside it. At nearly every decision point, the move was the same shape: the right call was to build less. Delegate it, decompose it, defer it, or delete it — but don’t add the thing my first instinct wanted to add.
That’s worth pulling apart, because “build less” is easy to say and hard to do under pressure. What made it repeatable here wasn’t discipline. It was a set of six specific questions that each had the power to remove work before any code got written. This piece walks the actual moments those questions fired, then formalizes them into a framework — and finally into a skill you could compile and run on your own architecture decisions.
Here’s the map. Six moves. Each one is a question I asked at a fork, and each one made the system smaller.
The system I was porting (an “eval loop” — a quality gate that scores AI output before it ships) was originally built on a platform that provided a lot of machinery: scheduling, approval buttons, multi-agent task boards, notifications, persistent memory. My first instinct was to find replacements for each one. Rebuild the scheduler. Rebuild the approval flow. Rebuild the notifier.
I stopped and asked a different question: what does the environment this will actually run in already provide?
The target architecture runs inside a code harness (a CLI agent). That harness already owns roughly twelve of the fifteen jobs a production agent platform has to do — turn-taking, permissions, scheduling, sub-agents, notifications. So the answer to “how do I rebuild the approval button” was: I don’t. I declare that an approval is needed, and I let the host enforce it.
The line I wrote into the architecture doc was blunt:
Declare governance in the contract; delegate enforcement to the host. Do not build either.
That single reframe deleted four items off the roadmap — policy engine, approval UI, telemetry, scheduler. None of them got built, because none of them were mine to build.
Concept #1: Ask what the host already does before you build anything. Most of the infrastructure you’re about to write is already running underneath you. Your job is to declare intent and bind to it, not to reimplement it.
Later, a real design question came up: where should the scoring standards live, and how do you handle many kinds of writing across many platforms? Sales copy, thought leadership, how-tos, listicles — each on LinkedIn, X, Facebook, and more.
The naive shape is a grid. Eight genres times six platforms is forty-eight rubric files, each a slightly-different monolith, all drifting apart over time. My first sketch was heading straight for that grid.
The question that broke it: are these two things actually the same kind of thing, or are they independent? They’re independent. Genre is what kind of writing it is — the substance. Platform is where it ships — the constraints. They’re orthogonal. Sales copy is sales copy whether it’s on LinkedIn or X; only the length and format conventions change.
Once they’re orthogonal, you don’t multiply them. You compose them:
resolved_rubric = genre_base ⊕ platform_overlay
Eight genres plus six platforms is fourteen small units, composed at the moment of use, instead of forty-eight monoliths maintained by hand.
The same move showed up one level down, inside a single rubric: I separated the judge (the generic scoring engine), the rubric (the named standard), and the gold standard (the example data) into three roles instead of fusing them into one file. Same instinct — find the seams that are actually independent, and don’t glue them together.
Concept #2: Find the orthogonal axes and compose them — don’t multiply them. When two variables are independent, a matrix is a trap. Composition turns an N×M maintenance problem into an N+M one.
The gate has to score output. Some of that scoring is checkable — does this JSON parse, does this match the expected label, does this regex hit. Some of it is judgment — is this writing actually good.
The tempting shortcut is to let the language model do all of it, because it can. But there’s a cost asymmetry I’ve learned to respect: code is cheap to write, cheap to test, and cheap to delete. Inference compounds — every call costs money and varies run to run, forever.
So I classified each scoring step before building it. The deterministic checks became a plain Python tool with no dependencies — exact-match, regex, a JSON validator, a similarity score. The genuinely open-ended judgment became an inference call, isolated behind a clean interface. The judge never grew a branch for “if the answer is checkable” — those never reach it.
Concept #3: Write the deterministic part as code; spend inference only where judgment is required. Code is a one-time cost you can delete. Inference is a recurring cost that compounds across every run. Classify the step before you reach for the model.
With genres, the easy path was to build every rubric as a full-featured scoring program up front. Instead I set a rule: a genre starts as the simplest possible thing — a plain reference document the generic judge reads. It only graduates to having its own scoring code when the generic judge provably can’t handle it.
Three genres earned the upgrade, and only because each needed real structural logic the judge scores inconsistently: a how-to needs its steps checked for dependency order, a listicle needs each item to earn its place, sales copy needs exactly one call-to-action and real proof. The other three stayed as simple documents. No code written for a need that hadn’t shown up yet.
Concept #4: Make every layer earn its place — add structure when it’s needed, not before. Pre-building for needs that haven’t arrived is the most expensive kind of guessing. Start with the cheapest form and let real friction pull you up the ladder.
This is the move that looks least like architecture and mattered most. Every scoring tool got smoke-tested against a known-good and a known-bad example the instant it existed — not at the end, not “later.”
That immediacy caught three real bugs on the spot. The listicle scorer gave a good tight list only 0.5, because it measured the body under each heading and the items were single lines — so I fixed it to measure the whole item span. The how-to scorer failed on the word “prerequisites” because a regex word-boundary was too strict for the plural. The sales-copy scorer gave a genuinely good piece a failing 0.4, because its call-to-action was “Get the kit” and my pattern list didn’t include the “get the/your X” form.
None of those would have been visible from reading the code. They were only visible by running it against a case I already knew the answer to.
Concept #5: Test each piece against a known-good and a known-bad the moment you build it. A scorer you haven’t run on something you already understand is a guess wearing a number. Immediate good-vs-bad checks turn invisible logic bugs into visible ones.
Two smaller decisions shared a spine. First: the scoring standards got promoted to a shared location that both the quality gate and any future writing tool resolve by name — so the thing that writes the copy and the thing that grades it aim at the identical standard. One contract, two consumers. The alternative — a copy of the rubric inside the gate — would have let the writer and the grader drift apart, which defeats the whole point.
Second: I marked the places I wasn’t certain instead of papering over them. When I documented the adapters for three different CLI tools, I flagged which specifics I’d verified against current docs and which were high-but-not-confirmed. When a research API ran out of quota, I said so. When the heuristic scorers shipped, I labeled them as cheap structural proxies feeding the judge — not the whole score. And when asked to seed the “gold standard” examples, I refused to invent them, because fabricated exemplars would make the entire standard fiction.
Concept #6: Keep one source of truth, and flag the seams you’re unsure about. Duplication lets two copies of a standard drift; a single named source can’t. And an honestly-marked uncertainty is a feature you can act on — a hidden one is a liability you can’t.
Step back from the specific system and the six moves rhyme. Every one of them is a question that, asked before writing code, can remove work:
| Move | The question | What it removes |
|---|---|---|
| Delegate | What does the host already do? | Infrastructure you’d otherwise rebuild |
| Decompose | What are the orthogonal axes? | The combinatorial matrix |
| Classify | Code or inference? | Recurring cost where code would do |
| Earn | Does this layer have a need yet? | Speculative structure |
| Verify | Good-vs-bad, now? | Invisible bugs that compound later |
| Anchor | One source, honest seams? | Drift and hidden risk |
I’d call this the Subtraction Pass: a pre-build reasoning pass where the default answer to “should I build this?” is “probably not yet,” and each of the six questions is a filter the proposed work has to survive. It’s not anti-building. It’s a discipline for making sure the things you do build are the things that genuinely had to be yours.
Here’s how you’d apply it outside this session. Say you’re a consultant about to build an AI intake system. Delegate: your scheduling tool and CRM already do reminders and forms — don’t rebuild them. Decompose: “industry” and “service tier” are orthogonal; template each axis and compose, don’t write a doc per pair. Classify: routing by keyword is code; summarizing the intake is inference. Earn: start with one generic intake flow, specialize only the industries that actually break it. Verify: run each new template against a real past client and a deliberately messy one. Anchor: keep the question set in one place both the intake bot and your proposal generator read.
This pass is regular enough to compile. If I were authoring it as a skill, it would look like this:
name: subtraction-pass
description: >
Run before building any new system, feature, or architecture. Takes a
proposed build and pushes it through six removal filters — delegate,
decompose, classify, earn, verify, anchor — surfacing what NOT to build
before any code is written. Trigger when the user is about to architect,
scaffold, or "build a system for X," or says "what's the cleanest way to
build," "am I overbuilding," or "design the architecture for."The body is the six filters as ordered gates, each with a question, a default of “don’t build,” and a one-line justification the proposal has to clear. The high-leverage design choice — straight out of this session — is that the deterministic checks in the pass (does the host expose this primitive? are these axes independent? is this output checkable?) get written as prompts the model answers against the actual environment, while the pass itself stays a fixed checklist. Same lesson as Move 3: the framework is the cheap, stable code; the judgment is the part you spend intelligence on.
It would live in a shared location both a planning agent and a building agent resolve by name — so the thing that designs and the thing that implements run the same pass. Same lesson as Move 6.
— Lou, looking over the shoulder of a refactor that kept choosing to do less.
The richest teaching material was in the moments the reasoning changed shape — the 48-file grid collapsing to 14, the four roadmap items getting deleted by a single reframe. Those carry more than the parts that just went smoothly, so the piece weights them. I compressed the actual build sequence hard (the original session was 96 turns of scaffolding and debugging) and reordered the decisions into the six clean moves; in the live session they interleaved and recurred rather than happening once each. Move 5 (verify-now) got promoted up the importance order because the three real bugs it caught are the most concrete, least abstract evidence in the whole piece. The single most valuable thing to add before publishing: a sidebar showing the failed version of one move — a place in some other project where skipping “delegate” or “verify” caused a real mess — so the pass is motivated by a scar, not just a success.