I Never Wrote the Spec. I Just Kept Asking "What Breaks?"

How a multi-agent context manager — design, stress test, working plugin, and all — got built across one session without a single written requirement. The whole thing ran on questions, not prompts.


I started with a fuzzy idea and no plan: I wanted multiple language models, skills, and agents working on the same project, doing different jobs at different times. Write a PRD with Claude, hand it to Codex to review, hand it to Gemini to stress-test, come back to Claude to fold in the findings. Each model brings something different. The work shouldn’t have to start over every time I switch windows.

What I did not do was write a spec, a PRD, or a requirements doc. I never described the system I wanted in any detail, because I didn’t know it in any detail. What I had was a hunch and a willingness to interrogate it.

By the end of the session there was a working architecture, a stress-tested protocol, a runtime skill, a routing table, and an installable plugin published to a live site. None of it came from a brief. All of it came from a loop: throw out a half-formed idea, make the model grill it, build the smallest slice, then ask “what breaks?” — and let the answer reshape the thing. This is the account of that loop, because the method turned out to be more transferable than the tool.

The build had no spec — it had a loop: idea, grill the forks, build the slice, ask what breaks, repeat

I refused to let it build anything yet

My opening prompt ended with an instruction that did more work than anything else I typed all session:

“grill me with any questions, blind spots, conflicts or other considerations we need to resolve before you can plan your PRD/Architecture”

I wasn’t asking for a design. I was asking to be cross-examined. The reasoning was simple: I knew my idea had holes, and I’d rather have them exposed by a model poking at it than discover them after something was built around the wrong assumption.

What came back was a single instinct in one breath — “this is a handoff journal problem, not a database problem” — followed by a refusal to commit until I’d settled the forks that actually changed the design. We went two rounds. Eight questions in total: where does the store live, what’s the canonical task source, how are tasks routed, how deep does mid-task resume go, how is the ritual enforced, do we trust a prior agent’s “done” mark, what’s the scope, can a fresh agent overturn an earlier decision.

I answered each one tersely — “in-repo, git-tracked,” “markdown in the store,” “task-level plus a handoff note,” “propose, you ratify.” Only after every fork was closed did the locked decision set appear:

Fork Choice
Store .context/ in-repo, git-tracked
Task source Canonical markdown (STATE.md)
Routing Hybrid: #focus tag + optional @assignee; assigned wins
Enforcement Both — hook (deterministic) + skill (structured write)
Decisions Append-only; reversal = “proposed-supersede” → I ratify

Here’s what struck me looking back: the architecture wasn’t invented after the questions. It was implied by the answers. Once the eight forks were closed, the four-file design (CHARTER, STATE, LOG, DECISIONS) was almost the only thing the answers could add up to. The questioning didn’t precede the design. The questioning was the design.

Concept #1: Refuse to build until the design-deciding forks are closed. The fastest way to a good architecture isn’t a better prompt — it’s making the model surface the handful of questions whose answers determine the whole shape, then answering those before a single file is written.

I made it try to break its own design — before any code existed

The spine was drafted but I didn’t trust it yet, so the next thing I typed was: “walk through this use case with your recommended implementation and tell me how it would work. stress test it a bit.”

This is the move I’d repeat more than any other in the session. Not “build it” — “try to break it, out loud, as a thought experiment.” What came back was eleven concrete failure scenarios: two windows racing to mutate the board, a session crashing mid-task, a reviewer reading another agent’s half-finished working tree, two models disagreeing about a recorded decision. Most passed. A few exposed real holes, and each hole got a small, localized fix:

Six fixes in total, all additive, none of them reshaping the core protocol. The thing I want to underline is when they happened: before there was any code to refactor. Each fix at that stage was a sentence in the charter or a column in a table. The same six fixes discovered after the system was built would have been a rewrite.

Concept #2: Stress-test the design as a thought experiment before you write code. Walk the architecture through five or six concurrent worst cases and watch where it bends. Fixes made against a design are cheap edits. Fixes made against a running system are refactors.

When I gave the go-ahead, it was deliberately scoped: “make the 6 fixes, make note of the residual gaps for later consideration, then ship it.” Make the fixes I’d validated. Park the gaps I hadn’t. Ship the slice.

I kept asking it to walk the workflow frame-by-frame

Even with a built protocol, I didn’t believe it handled my actual use case until I made it prove the case end to end. So I posed the real scenario as a challenge:

“it’s important that a task like ‘write a prd’ may mean having claude write a first draft, then switch to codex to review it, then gemini to stress test it, then come back to claude to continue. how does the current architecture support that? walk it through.”

The walk-through came back round by round — Claude drafts and checks out, Codex claims the auto-spawned review task and reads the diff blind, Gemini stress-tests under a different focus tag, Claude returns to a board showing two findings and the full handoff trail. And critically, it named where the workflow bent: adversarial tasks didn’t auto-spawn the way reviews did, and routing relied on me telling each CLI what it was good at, because the models didn’t pre-declare their own specialty.

That second friction point is what produced agents.yaml — a declarative routing table that didn’t exist until the walk-through exposed the gap:

claude:
  command: claude
  focus: [code, content, review]
  cost_tier: high
codex:
  command: codex
  focus: [code, review]
gemini:
  command: gemini
  focus: [adversarial, research]
  notes: "strong at edge-case generation"

I hadn’t asked for a routing table. I’d asked the model to walk a workflow, and the walk surfaced the missing piece. That’s the pattern: the artifact didn’t come from me specifying it. It came from interrogating the workflow until the gap announced itself.

Concept #3: Make the model walk the workflow frame-by-frame before you trust it. Don’t ask “does it support this?” — ask “walk it through, step by step.” A frame-by-frame trace exposes the exact step where the design bends, and the fix is usually a small additive piece you’d never have specified up front.

There was a quieter version of this same move a few exchanges earlier. I asked a flat question — “so when is the new state saved? every command?” — and the honest answer was “only at check-out, end of session,” which exposed a crash-loss gap I hadn’t considered. That one question produced the mid-session checkpoint feature. I told it to implement that and park the heavier alternative: “implement a, save b for later.” A question I asked to understand the system ended up changing it.

I questioned the package, not just the code

Once the protocol worked, I turned the same interrogation on how it was packaged — and this is where the design churned the most.

The first packaging answer was two plugins: a project-scoped installer and a user-global session skill, split because the session runtime “needed to be global for general use.” I didn’t accept that. I asked the obvious lazy-user question:

“if i defined a plugin called multi-context-mgr wherein were included the ambient, context-mgr, and session skills, couldn’t i just install the plugin in a local project and have the three skills local to the project, without ‘polluting’ the global claude/codex/gemini context”

The answer was yes — and that it was cleaner than what had been built. The original two-scope split had been a reasonable call at the time, but my question reframed the goal from “make session reusable everywhere” to “keep this self-contained and non-polluting,” and under that goal one project-local plugin was strictly better. So the two plugins collapsed into one.

Then I corrected my own question. I’d lumped in the ambient skill, and a beat later realized it didn’t belong: “the ambient skill is not part of the context manager.” Out it came again, along with every doc that had referenced it.

This is the unglamorous truth of building by interrogation: the questions don’t march in a straight line. “Couldn’t I bundle these?” reshaped the package one way; “wait, that one doesn’t belong” reshaped it back. Each was right at the moment I asked it. The package found its shape by being questioned from both sides.

Concept #4: Question the package, not just the code. The first packaging decision is rarely the right one. Asking “couldn’t I just install it locally?” or “does this piece even belong here?” reshapes scope more cheaply than any refactor — packaging is decided in sentences, not commits.

The tax I didn’t see coming

Here’s the part that landed hardest, because it was ironic.

The entire system exists on one foundational law, the reason the design rejected a database from the start: two sources of truth always drift. One canonical board, in the format every model already speaks — never two copies to keep in sync.

And then I watched my own project break that law for the rest of the session.

The teaching block, the chat export, the vault-artifacts snapshot, the copy synced into the ambient library, the rendered HTML, and eventually the published page on the hub — these were all derivative copies of decisions that kept changing. Every time a question reshaped the design, the cascade fired. Consolidate the two plugins into one? Update the source repo, the ambient-library copy, the vault snapshot, the teaching block article, and re-render the HTML. Strip ambient back out? Same cascade again. Each of my late prompts — “update the exported chat and teaching block files,” “need to update the vault-artifacts too, don’t you?” — was me manually paying the synchronization tax the architecture was specifically designed to abolish.

I was building a cure for drift while generating drift in the documentation about the cure. The lesson wasn’t a failure — the copies served real purposes — but it was a live, unmissable demonstration of why the core design rejected multiple sources of truth. The law held. It always holds. I just got to feel it from the inside.

Concept #5: Keep one source of truth, or pay the sync tax forever. Every derivative copy of a changing decision is a manual update you’ll owe each time the decision moves. Sometimes the copies are worth it — but count them before you make them, because each one is a standing debt, not a one-time cost.

What this unlocks

The tool here was a context manager. The transferable thing is the method: building by interrogation instead of by specification.

The default way to build with an AI is to specify — write the prompt, describe the output, refine the result. That works when you already know what you want. When you don’t — when you have a hunch and a direction but no spec — specification fails, because you’ll specify the wrong thing in confident detail.

Interrogation is the alternative. You don’t describe the system; you describe the hunch, then drive it through a loop:

  1. Make it grill you before it builds. The forks it surfaces are the design.
  2. Make it break its own design as a thought experiment. The holes it finds are free.
  3. Make it walk the workflow frame-by-frame. The step where it bends is where the missing piece lives.
  4. Question the packaging as hard as the code.

Here’s how you’d adapt it outside of code. Say you’re shaping a go-to-market plan with an AI. Don’t ask it to “write a GTM strategy.” Open with “grill me on what would make this fail before you draft anything.” Once it’s drafted, say “stress test this — walk me through the three ways a competitor responds.” When it holds, “walk the first 30 days frame by frame.” You’ll end up somewhere you could never have specified at the start, because the plan was discovered through the questions, not dictated to them. The domain changes; the loop doesn’t.

Key takeaways

How to start

If you want to try building by interrogation on your own next project:

  1. Open with the grilling prompt, verbatim: “Before you build or plan anything, grill me with the questions, blind spots, and conflicts we need to resolve first — the ones whose answers change the whole design.” Answer them tersely before you let it produce anything.
  2. Once there’s a draft design, say: “Stress test this. Walk it through five concurrent worst-case scenarios and tell me where it bends.” Fix what bends, in the design, before any code.
  3. Before you trust it on your real use case, say: “Walk my actual workflow through this, frame by frame, and name where it breaks.” Build the piece the walk-through exposes.
  4. When it’s working, interrogate the packaging: “Could I bundle this differently? Does each piece actually belong here?” Let scope find its shape under questioning.
  5. Before you make a second copy of anything, count it. Ask: every time this decision changes, will I have to update this copy by hand? If yes, decide on purpose.

— Lou, from a session that built a multi-agent context manager without ever writing down what it was supposed to be.


Behind the Article

The richest teaching material in this session wasn’t the architecture — that story was already told in the companion piece, “One Brain, Many Doors.” It was the method: this transcript is an unusually clean example of someone building a working system entirely through questions, so I built the article around that spine instead of re-telling the design.

I chose the five interrogation moves (grill / break / walk / package / count) because each maps to a literal prompt I can quote verbatim from the session, which keeps the teaching replicable rather than abstract. I compressed the eight forks and eleven stress scenarios to their highest-signal examples — the full lists live in the chat export and would have buried the method under inventory. The drift-tax section is placed last on purpose: it’s the payoff, the moment the project demonstrates its own thesis, and it only lands once the reader has seen how much the design churned.

The single most valuable thing I could add before publishing: a short side-by-side of one bad specification prompt versus the grilling prompt that replaced it, run on the same starting idea, so the reader sees the divergence in outputs rather than taking the claim on faith.