The same coordination protocol gets enforced three different ways across Claude, Codex, and open-source models. Pretending they’re equal is the real bug — here’s how I learned to match the mechanism to the tool instead.
I had a working protocol for multiple AI agents sharing one project:
a .context/ store every agent reads on entry and writes on
exit, with a check-in / check-out ritual that keeps the handoff clean.
In Claude Code it was wired with hooks, so the ritual fired
automatically. Then I wanted the same thing for Codex and Gemini, and I
asked what I assumed was a routine question:
“so if they dont use hooks, how is the context management enforced with codex, gemini, or other (open source?) llms?”
I expected a config-file equivalent — “here’s where Codex keeps its version of the hook.” What I got instead was a flat admission that changed how I thought about the whole system:
“Honestly? It isn’t — not the same way. That’s the real gap.”
That sentence is the article. The protocol doesn’t get enforced one way across tools. It gets enforced at three different strengths, and the work isn’t making them uniform — it’s being honest about which is which and matching the mechanism to the tool.
My mental model going in was binary — either the protocol is enforced or it isn’t. I’d seen the Claude Code hooks work, watched the SessionStart hook inject the board into context before I’d even typed anything, and concluded “good, it’s enforced.” So when I moved to other CLIs, I went looking for the same switch.
There is no same switch. And the reason matters more than the fact.
The Claude Code hook works because it runs at the harness level — the layer around the model, before the model gets a turn. The hook fires whether or not the model is paying attention, whether or not it’s the kind of model that remembers instructions, whether or not it’s rushing to finish. The model doesn’t enforce the rule. The harness does, and the model just inherits a context that already has the rule’s effects baked in.
That’s a fundamentally different thing from “the model has been told to follow the rule.” And once I saw that distinction, the tiers fell out on their own.
Concept #1: Ask where the rule actually runs — in the model, or in the harness around it. Only enforcement that happens outside the model is real enforcement. If compliance depends on the model choosing to comply, you don’t have a rule — you have a strong suggestion. Know which one you’re shipping.
Here’s how the breakdown landed, strongest to weakest:
Tier 1 — Harness-enforced (Claude Code only). The hook runs before the model sees anything. A forgetful or distracted model can’t skip it, because skipping isn’t a decision the model gets to make. This is the only tier that’s true enforcement.
Tier 2 — Always-loaded instruction (Codex, Gemini).
AGENTS.md and GEMINI.md are read automatically
at startup by their respective CLIs — the same way Claude reads
CLAUDE.md. The model sees the protocol every
session. But seeing isn’t running. A model that drifts mid-session, gets
distracted, or rushes the ending can still skip check-out. Compliance is
probabilistic, not guaranteed.
Tier 3 — Open-source / custom (Ollama, LM Studio, and friends). No hook system, no always-loaded config file. You’re relying entirely on whatever system prompt you inject by hand, and compliance tracks model quality — a smaller model will quietly miss steps.
The honest summary came as a reliability table, and I want to reproduce it because the honesty is the point:
| Scenario | Reliability |
|---|---|
| Claude Code + hooks | Near-certain — harness enforces it |
| Codex / Gemini + AGENTS/GEMINI.md | Good — seen every session, but can drift |
| Open-source + system prompt only | Variable — depends on model and prompt discipline |
| Any CLI + wrapper script | Good — shell enforces entry, prompts check-out |
The temptation, when you’ve built something, is to claim it works
everywhere. The more useful move was the opposite: name exactly where it
degrades, so I’d know which gaps to plan around instead of discovering
them when LOG.md had holes in it.
Concept #2: Make degradation honest and visible, not hidden. A capability matrix that admits “this is variable here” is worth more than one that claims uniform support. You can design around a gap you can see. You can’t design around one you papered over.
The instinct after seeing the tiers was to try to drag Tier 2 and Tier 3 up to Tier 1 — to find some clever way to make Codex and Ollama enforce the ritual as hard as a Claude hook. That instinct was wrong, and resisting it was the actual design decision.
You can’t give a tool a harness it doesn’t have. What you
can do is wrap the tool from outside. That’s what
context-run.sh is — a thin shell script that does the
harness’s job in shell instead of in the CLI:
#!/bin/bash
# context-run.sh — wraps any CLI with check-in/check-out enforcement
# CHECK-IN: inject STATE + recent LOG before the model starts
CONTEXT=$(cat .context/STATE.md && echo "---" && head -80 .context/LOG.md)
export CONTEXT_INJECT="$CONTEXT"
# Run the CLI (pass through all args)
"$1" "${@:2}"
# CHECK-OUT: after CLI exits, offer to run the check-out ritual
echo "Session ended. Run check-out? (y/n)"
read -r answer
if [ "$answer" = "y" ]; then
"$1" "Read .context/STATE.md and .context/LOG.md, then perform the check-out ritual from CHARTER.md."
fiIt’s blunt, but it works — it’s the same pattern as Claude’s hooks, just implemented one layer out. The shell injects state on entry (so even a model that ignores its instruction file still receives the board) and prompts the ritual on exit. It lifts Tier 2 and Tier 3 toward Tier 1 without pretending the underlying tool changed.
Concept #3: When a tool lacks the rail you need, build the rail around it. You can’t add a harness to a CLI that doesn’t have one — but you can wrap the CLI in a shell that does the harness’s job. Enforce from the layer the tool can’t skip: the process that launches it.
I had the wrapper, it worked, and the obvious next step was to route everything through it — one uniform entry point, enforcement everywhere. I caught myself before I did, and the instruction I gave was deliberate:
“yes, write it, but don’t make it obligatory. if i use claude code, i still want it to use the hooks approach. for other llm’s i’ll run the shell”
Two reasons this mattered. First, Claude Code already had the
better enforcement — its native hooks. Forcing it through a
shell wrapper would have been a downgrade dressed as consistency,
trading a harness-level guarantee for a shell-level one just to make the
tools look alike. Second, the wrapper’s prompts are deliberately
escapable — the check-out is [y/N], the clipboard copy is
[y/N], nothing is forced. A safety rail you can’t decline
becomes friction you route around; a safety rail you opt into stays
useful.
So the final shape is asymmetric on purpose: Claude Code keeps hooks, everything else runs the shell by choice, and nothing is mandatory. The consistency I gave up bought reliability I kept.
Concept #4: Make the safety rail opt-in, not obligatory. Forcing every tool through one mechanism downgrades the tools that already had something better, and a mandatory rail invites workarounds. Let each tool use its strongest available enforcement, and make the fallback a choice.
Strip away the context-manager specifics and the pattern is about enforcing any rule across tools you don’t fully control — which is most of the interesting work now that nobody’s stack is single-vendor.
The reflex is to want one mechanism that works everywhere. That reflex produces either a lowest-common-denominator rule (weak enough that every tool can run it, useless because it’s so weak) or a brittle abstraction that fakes uniformity until the day it doesn’t. The better frame is a tiered one: figure out the strongest enforcement each tool natively supports, use that, and add an external wrapper to lift the weak ones — without dragging the strong ones down to match.
Here’s how you’d adapt it well outside of AI tooling. Say you’re enforcing a code-review policy across a team that uses GitHub, a vendor’s hosted GitLab, and one legacy SVN repo. The losing move is a policy doc that says “everyone reviews before merge” and hoping. The tiered move: branch protection rules on GitHub (harness-level — the platform blocks the merge), a required-approval setting on GitLab (good — the platform nudges but config can drift), and for SVN, a pre-commit hook script you install on the server (the wrapper — you built the rail the platform lacks). Same policy, three enforcement strengths, honestly mapped. You stop pretending the SVN repo is as locked-down as GitHub, and you plan accordingly.
The domain changes. The move doesn’t: rank your tools by native enforcement strength, use each tool’s best, wrap the weak ones, and never lie to yourself about which tier a given tool is in.
If you need to enforce a rule across tools with different capabilities:
[y/N], nothing forced. A rail people can decline is a rail
they’ll actually keep using.— Lou, after asking a routine question and getting an honest answer that reshaped the design.
This arc was the one chunk of the session with no overlap with the other two teaching blocks — the architecture piece is about what the store is, the method piece is about how it got built, and this one is about a single sharp realization: enforcement is a spectrum, not a switch. I built it around the verbatim “Honestly? It isn’t” answer because the honesty is the whole lesson, and a paraphrase would have softened exactly the thing worth keeping.
I led with my wrong assumption (enforcement as a binary feature) on purpose — the tier model only lands if you’ve felt the binary one fail first. The hardest editorial call was the mandatory-vs-opt-in section: it’s a small decision in the transcript (one sentence from me) but it carries the article’s most counterintuitive lesson, so I gave it its own beat rather than burying it in the wrapper section.
The single most valuable thing I could add before publishing: a real
before/after of a LOG.md with a Tier-2 gap in it — an
actual missing check-out entry from a drifted session — so the reader
sees the cost of the gap instead of taking “it degrades” on faith.