Designing a context manager that lets multiple AI agents — different models, different CLIs, different days — work on the same project without losing the thread.
The store is the truth. Open terminals are just doors into it.
That line came late in the design session — but once it landed, it collapsed three weeks of potential over-engineering into four markdown files and a decision that turned out to be obvious. Here’s how it got there, and why the first instinct would have built something complicated, fragile, and broken.
The ambition: write a PRD with Claude, switch to Codex to review it, switch to Gemini to stress-test it, come back to Claude to incorporate the findings. Each model brings something different. The work shouldn’t have to start over each time the window changes.
The first instinct is to ask: how do we make the models talk to each other? That’s the wrong question. And chasing it would have produced something complicated, fragile, and Claude-specific.
If you start from “agents need to share context,” you end up thinking about message buses, real-time streams, MCP plumbing, orchestrator daemons that hold conversation state and replay it into each new model. All of that is real engineering. None of it would have worked.
It wouldn’t have worked because the agents are in different processes, different runtimes, sometimes different sessions hours apart. They have no shared memory. Forcing them to act like they do means building a synchronization layer none of the tools want to support, then watching it break the first time a model is updated, a CLI changes its interface, or a session crashes mid-handoff.
The premise itself was wrong. The agents don’t actually need to talk to each other. They need to leave each other notes.
Concept #1: When state lives outside the actor, the system stops caring who acts. The hardest problem in multi-agent coordination isn’t communication — it’s persistence. Solve persistence with a shared, durable store, and the communication problem dissolves. Whichever agent shows up next reads the store, picks up where the last one stopped, and writes back what it did. The store becomes the protocol.
The architecture that fell out was almost embarrassingly simple. Four
markdown files in a directory called .context/ at the
project root, plus a tiny lock file:
| File | Purpose |
|---|---|
CHARTER.md |
The static protocol: how this team works (read once per agent) |
STATE.md |
The live board: active objective, task queue, who’s on what |
LOG.md |
Append-only handoff journal — what just happened |
DECISIONS.md |
Append-only ADR-lite — why it’s like this |
lock |
Short-held mutex during STATE/DECISIONS mutations |
That’s the whole brain. No database, no service, no MCP server, no custom protocol. Markdown is agent-native — every model reads and writes it without translation. Git is already there for versioning, attribution, and “what changed.”
The session arrived at this through a deliberate set of forks rather than guessing. Eight architectural questions over two rounds — where does the store live, what’s the canonical task source, how are tasks routed, how deep is mid-task resume, how is the ritual enforced, do we trust prior “done” marks, what’s the scope, can a fresh agent overturn a prior decision? Each fork closed off a wrong path before it could grow.
The forks that mattered most:
The store is the truth. Open terminals are just doors into it. That wasn’t metaphor — it was load-bearing. It meant:
This last point became important later, when the question came up: if I build an orchestrator agent that auto-spawns the CLIs, is that a different system? No. The protocol’s contract was always with the board, not with the human. An orchestrator agent is Lou-as-a-script. The board doesn’t know the difference.
After the spine was drafted, the session ran a deliberate stress test — eleven concrete scenarios where multiple models worked the same project at the same time. Most of the scenarios passed cleanly. A few exposed real holes:
| Scenario | Hole | Fix |
|---|---|---|
| Reviewer reads the WIP working tree of another agent | “Is this final or mid-edit?” ambiguity | Review against commit SHAs, never working trees. Every Ready-for-Review entry records its commit. |
| Doer marks task ready; reviewer has no entry point | Manual queueing every time | Auto-pair review tasks. Check-out spawns a paired
T-NNN-R Review @ <SHA> — #review task. |
| A session crashes leaving a task stuck in Doing | Phantom ownership | started: timestamp + 4h staleness
check. Surfaces to “Needs Lou” on the next check-in. |
| Reviewer reads the doer’s LOG before the diff | Anchoring on the doer’s framing | Blind-first-read discipline. Read the task definition, then the diff blind, then (only then) the LOG. |
| Multi-model reviews are stylistically inconsistent | Findings noise | A finding template with severity / location / evidence / suggested fix. Codified in CHARTER. |
| Multiple worktrees fork the brain | Silent divergence | .context/ lives in the main repo, symlinked
into each worktree. One brain, always. |
The point isn’t that these fixes were clever. The point is that running the architecture as a thought experiment against real concurrent scenarios — before any code — surfaced six real failure modes, each of which got a small, localized, additive fix. None required reshaping the protocol.
Concept #2: Stress-test the architecture before you stress-test the code. Sit with the design and walk it through five or six worst-case scenarios out loud. Watch where it bends. The fixes you make at that stage are tiny — a column added to a table, a sentence in the charter. The same fixes after the system is built are a refactor.
A protocol that depends on agents remembering to follow it is a protocol that fails the moment a model is forgetful, rushed, or distracted. The session split the protocol into two layers:
.context/ in the project.
Markdown, git, agnostic.session skill that
automates check-in, optional mid-session checkpoint, and check-out.For Claude Code, the runtime is wired with hooks at the harness level — the SessionStart hook injects STATE and LOG into the model’s context before the model sees the user’s first message. The Stop hook nags at exit. The harness runs the ritual, not the model. Even a forgetful agent can’t skip the read.
For Codex, Gemini, and other CLIs without a hook system, the protocol
still works — they read AGENTS.md and
GEMINI.md files at startup (their respective always-loaded
instruction files) and follow the same check-in / check-out steps. Less
guaranteed than a hook, but a tier-2 fallback that still gets the job
done in 90% of sessions. For tier-3 (open-source LLMs, no instruction
file), a wrapper shell script (context-run.sh) wraps the
CLI invocation with explicit check-in injection and an opt-in check-out
prompt at exit.
Concept #3: Enforcement is a tiered design choice, not a single mechanism. The same protocol can be enforced at three different levels: harness-enforced (hooks — near-certain), instruction-loaded (read-on-startup files — good), or wrapper-enforced (shell script around the CLI — variable). Match the tier to the tool. Don’t try to make every tool work the same way.
To validate the whole thing, the session walked a concrete cross-CLI pipeline frame by frame — the PRD example. Claude drafts. Codex reviews. Gemini stress-tests. Claude addresses findings. At the end, the LOG read like a complete narrative — four entries from three different models, one coherent story.
The friction points named honestly: - Adversarial tasks don’t
auto-spawn (only reviews do) — that became agents.yaml, a
routing table that maps focus tags to CLIs and serves both human and
script orchestration. - Codex/Gemini check-out is two steps via the
wrapper (exit, then re-invoke with the check-out prompt) — acceptable
given the tier. - Cross-window chat memory is private — durable insights
must be promoted to DECISIONS or they die with the session. The
check-out ritual now prompts for this explicitly.
The walk-through also revealed the architecture’s bigger insight:
human orchestration and agent orchestration are the same
architecture. The board is the shared rail. Lou drives
manually, or orchestrate.sh drives automatically — same
reads, same writes, same lock, same routing rules. They can even hand
off mid-pipeline, because the board is the handoff.
The final move was packaging. The original design split things into
two plugins — a project-scoped installer and a user-global session skill
— but a follow-up clarification collapsed that. The session skill only
makes sense in a project that has .context/ set up. So its
scope is project-local too, and the right package is one plugin with two
skills:
context-mgr — the installer skill,
install.sh, and all protocol file templatessession — check-in, optional
mid-session checkpoint, and check-outThe install story that fell out of that consolidation:
# one-time: get the plugin (do this once per machine)
/plugin install loudalo/context-mgr
# per-project: scaffold the protocol files
/install-context-mgr ← installs into current project
/install-context-mgr /path/to/project ← installs into a specified path
/install-context-mgr --check ← dry run first
/install-context-mgr is a user-global slash command
(~/.claude/commands/install-context-mgr.md) that calls the
plugin’s install.sh and passes any flags through. One
command, any project, no plugin scope confusion.
The narrower the install footprint, the cleaner the context. Skills that only activate in projects that opted in are skills that can’t pollute unrelated sessions.
The session’s core move wasn’t technical. It was an inversion of framing.
The default frame for “many agents, one project” is agent-to-agent coordination — message passing, shared memory, real-time sync. That frame is exhausting to implement and exhausting to maintain. It assumes the agents are the load-bearing actors.
The frame that worked is agent-to-store coordination. The agents don’t need to know about each other. They need a shared truth that persists between them. The store is load-bearing; the agents are interchangeable doors.
Once you make that flip:
That generalization is portable beyond context managers. It applies to any system where multiple processes need to coordinate without coupling: build the durable truth first, then watch the coordination problem dissolve into reads and writes against it.
The store is the truth. Open terminals are just doors into it.