I Asked How the Rule Was Enforced. The Honest Answer Was "It Isn't."

The same coordination protocol gets enforced three different ways across Claude, Codex, and open-source models. Pretending they’re equal is the real bug — here’s how I learned to match the mechanism to the tool instead.

I had a working protocol for multiple AI agents sharing one project: a .context/ store every agent reads on entry and writes on exit, with a check-in / check-out ritual that keeps the handoff clean. In Claude Code it was wired with hooks, so the ritual fired automatically. Then I wanted the same thing for Codex and Gemini, and I asked what I assumed was a routine question:

“so if they dont use hooks, how is the context management enforced with codex, gemini, or other (open source?) llms?”

I expected a config-file equivalent — “here’s where Codex keeps its version of the hook.” What I got instead was a flat admission that changed how I thought about the whole system:

“Honestly? It isn’t — not the same way. That’s the real gap.”

That sentence is the article. The protocol doesn’t get enforced one way across tools. It gets enforced at three different strengths, and the work isn’t making them uniform — it’s being honest about which is which and matching the mechanism to the tool.

One protocol, three ways to make it stick: harness hook is near-certain, instruction file is good, system prompt only is variable; a shell wrapper lifts the weaker tiers

What I assumed: enforcement is a feature you turn on

My mental model going in was binary — either the protocol is enforced or it isn’t. I’d seen the Claude Code hooks work, watched the SessionStart hook inject the board into context before I’d even typed anything, and concluded “good, it’s enforced.” So when I moved to other CLIs, I went looking for the same switch.

There is no same switch. And the reason matters more than the fact.

The Claude Code hook works because it runs at the harness level — the layer around the model, before the model gets a turn. The hook fires whether or not the model is paying attention, whether or not it’s the kind of model that remembers instructions, whether or not it’s rushing to finish. The model doesn’t enforce the rule. The harness does, and the model just inherits a context that already has the rule’s effects baked in.

That’s a fundamentally different thing from “the model has been told to follow the rule.” And once I saw that distinction, the tiers fell out on their own.

Concept #1: Ask where the rule actually runs — in the model, or in the harness around it. Only enforcement that happens outside the model is real enforcement. If compliance depends on the model choosing to comply, you don’t have a rule — you have a strong suggestion. Know which one you’re shipping.

The three tiers, ranked by how much I can trust them

Here’s how the breakdown landed, strongest to weakest:

Tier 1 — Harness-enforced (Claude Code only). The hook runs before the model sees anything. A forgetful or distracted model can’t skip it, because skipping isn’t a decision the model gets to make. This is the only tier that’s true enforcement.

Tier 2 — Always-loaded instruction (Codex, Gemini). AGENTS.md and GEMINI.md are read automatically at startup by their respective CLIs — the same way Claude reads CLAUDE.md. The model sees the protocol every session. But seeing isn’t running. A model that drifts mid-session, gets distracted, or rushes the ending can still skip check-out. Compliance is probabilistic, not guaranteed.

Tier 3 — Open-source / custom (Ollama, LM Studio, and friends). No hook system, no always-loaded config file. You’re relying entirely on whatever system prompt you inject by hand, and compliance tracks model quality — a smaller model will quietly miss steps.

The honest summary came as a reliability table, and I want to reproduce it because the honesty is the point:

Scenario	Reliability
Claude Code + hooks	Near-certain — harness enforces it
Codex / Gemini + AGENTS/GEMINI.md	Good — seen every session, but can drift
Open-source + system prompt only	Variable — depends on model and prompt discipline
Any CLI + wrapper script	Good — shell enforces entry, prompts check-out

The temptation, when you’ve built something, is to claim it works everywhere. The more useful move was the opposite: name exactly where it degrades, so I’d know which gaps to plan around instead of discovering them when LOG.md had holes in it.

Concept #2: Make degradation honest and visible, not hidden. A capability matrix that admits “this is variable here” is worth more than one that claims uniform support. You can design around a gap you can see. You can’t design around one you papered over.

The fix wasn’t to make them equal

The instinct after seeing the tiers was to try to drag Tier 2 and Tier 3 up to Tier 1 — to find some clever way to make Codex and Ollama enforce the ritual as hard as a Claude hook. That instinct was wrong, and resisting it was the actual design decision.

You can’t give a tool a harness it doesn’t have. What you can do is wrap the tool from outside. That’s what context-run.sh is — a thin shell script that does the harness’s job in shell instead of in the CLI:

#!/bin/bash
# context-run.sh — wraps any CLI with check-in/check-out enforcement

# CHECK-IN: inject STATE + recent LOG before the model starts
CONTEXT=$(cat .context/STATE.md && echo "---" && head -80 .context/LOG.md)
export CONTEXT_INJECT="$CONTEXT"

# Run the CLI (pass through all args)
"$1" "${@:2}"

# CHECK-OUT: after CLI exits, offer to run the check-out ritual
echo "Session ended. Run check-out? (y/n)"
read -r answer
if [ "$answer" = "y" ]; then
  "$1" "Read .context/STATE.md and .context/LOG.md, then perform the check-out ritual from CHARTER.md."
fi

It’s blunt, but it works — it’s the same pattern as Claude’s hooks, just implemented one layer out. The shell injects state on entry (so even a model that ignores its instruction file still receives the board) and prompts the ritual on exit. It lifts Tier 2 and Tier 3 toward Tier 1 without pretending the underlying tool changed.

Concept #3: When a tool lacks the rail you need, build the rail around it. You can’t add a harness to a CLI that doesn’t have one — but you can wrap the CLI in a shell that does the harness’s job. Enforce from the layer the tool can’t skip: the process that launches it.

The decision I almost got wrong: making it mandatory

I had the wrapper, it worked, and the obvious next step was to route everything through it — one uniform entry point, enforcement everywhere. I caught myself before I did, and the instruction I gave was deliberate:

“yes, write it, but don’t make it obligatory. if i use claude code, i still want it to use the hooks approach. for other llm’s i’ll run the shell”

Two reasons this mattered. First, Claude Code already had the better enforcement — its native hooks. Forcing it through a shell wrapper would have been a downgrade dressed as consistency, trading a harness-level guarantee for a shell-level one just to make the tools look alike. Second, the wrapper’s prompts are deliberately escapable — the check-out is [y/N], the clipboard copy is [y/N], nothing is forced. A safety rail you can’t decline becomes friction you route around; a safety rail you opt into stays useful.

So the final shape is asymmetric on purpose: Claude Code keeps hooks, everything else runs the shell by choice, and nothing is mandatory. The consistency I gave up bought reliability I kept.

Concept #4: Make the safety rail opt-in, not obligatory. Forcing every tool through one mechanism downgrades the tools that already had something better, and a mandatory rail invites workarounds. Let each tool use its strongest available enforcement, and make the fallback a choice.

What this unlocks

Strip away the context-manager specifics and the pattern is about enforcing any rule across tools you don’t fully control — which is most of the interesting work now that nobody’s stack is single-vendor.

The reflex is to want one mechanism that works everywhere. That reflex produces either a lowest-common-denominator rule (weak enough that every tool can run it, useless because it’s so weak) or a brittle abstraction that fakes uniformity until the day it doesn’t. The better frame is a tiered one: figure out the strongest enforcement each tool natively supports, use that, and add an external wrapper to lift the weak ones — without dragging the strong ones down to match.

Here’s how you’d adapt it well outside of AI tooling. Say you’re enforcing a code-review policy across a team that uses GitHub, a vendor’s hosted GitLab, and one legacy SVN repo. The losing move is a policy doc that says “everyone reviews before merge” and hoping. The tiered move: branch protection rules on GitHub (harness-level — the platform blocks the merge), a required-approval setting on GitLab (good — the platform nudges but config can drift), and for SVN, a pre-commit hook script you install on the server (the wrapper — you built the rail the platform lacks). Same policy, three enforcement strengths, honestly mapped. You stop pretending the SVN repo is as locked-down as GitHub, and you plan accordingly.

The domain changes. The move doesn’t: rank your tools by native enforcement strength, use each tool’s best, wrap the weak ones, and never lie to yourself about which tier a given tool is in.

Key takeaways

Only enforcement outside the model counts. A hook that runs in the harness is a rule; an instruction the model reads is a suggestion. Know which you’re shipping.
There are three tiers, not two. Harness-enforced (near-certain), instruction-loaded (good, drifts), system-prompt-only (variable). Most multi-tool setups span all three whether you’ve named them or not.
Honesty about degradation is a feature. A matrix that says “variable here” lets you design around the gap. A matrix that claims uniform support hides the gap until it bites.
Build the rail the tool lacks — don’t fake the rail it has. A shell wrapper can do the harness’s job from one layer out, lifting weak tiers without touching strong ones.
Opt-in beats obligatory. Forcing every tool through one mechanism downgrades the best ones and invites workarounds.

How to start

If you need to enforce a rule across tools with different capabilities:

For each tool, ask one question: “Does enforcement happen in the harness, or does it depend on the model/user choosing to comply?” Sort your tools into tiers by the answer.
Use each tool’s strongest native mechanism first. Don’t reach for a uniform wrapper before you’ve used the platform-level enforcement the tool already gives you (branch protection, startup config files, hooks).
Write the wrapper only for the tiers that need it. A thin launcher script that injects state on entry and prompts the ritual on exit will lift your weak tiers — leave the strong ones alone.
Make the wrapper escapable. Prompts should be [y/N], nothing forced. A rail people can decline is a rail they’ll actually keep using.
Write the reliability matrix down. One small table mapping tool → enforcement tier → reliability. It’s the document you’ll thank yourself for when something drifts.

— Lou, after asking a routine question and getting an honest answer that reshaped the design.

Behind the Article

This arc was the one chunk of the session with no overlap with the other two teaching blocks — the architecture piece is about what the store is, the method piece is about how it got built, and this one is about a single sharp realization: enforcement is a spectrum, not a switch. I built it around the verbatim “Honestly? It isn’t” answer because the honesty is the whole lesson, and a paraphrase would have softened exactly the thing worth keeping.

I led with my wrong assumption (enforcement as a binary feature) on purpose — the tier model only lands if you’ve felt the binary one fail first. The hardest editorial call was the mandatory-vs-opt-in section: it’s a small decision in the transcript (one sentence from me) but it carries the article’s most counterintuitive lesson, so I gave it its own beat rather than burying it in the wrapper section.

The single most valuable thing I could add before publishing: a real before/after of a LOG.md with a Tier-2 gap in it — an actual missing check-out entry from a drifted session — so the reader sees the cost of the gap instead of taking “it degrades” on faith.