How a model-selection guide became much more useful after I audited it against a real agentic writing pipeline.
This article is about a mistake I made in the shape of my advice.
Later in the series, the forked writing team shows this routing logic in a real workflow. If you want the worked example after this concept piece, read A Worked Example: The Forked Writing Team.
That matters here because model selection looks different once you stop thinking of writing as “one prompt produces one article” and start thinking of it as a pipeline.
The original question was straightforward:
analyze the recommendations in https://brave-quiche-rax4.here.now/aimm-model-selection-guide.html
where do you agree/disagree?
I gave a useful answer. But it was useful at the wrong altitude.
My first pass focused on the obvious dimensions:
The core recommendation was:
Add a second axis to the decision tree: stakes x grounding requirement. Tier picks the model; grounding picks whether web search is mandatory.
That was not wrong.
A bylined thought-leadership piece has different stakes than a transcript cleanup. A fact-heavy article needs grounding. Effort should scale with reasoning complexity, not word count. Caching can matter more than model tier for recurring formats.
All of that still holds.
But I was implicitly treating the article as the unit of routing.
The advice sounded like:
For this piece, choose this model.
That missed how the actual system worked.
The user then pointed me at the real codebase:
@"~/GitHub/code experiments/agentic-writing-team"
audit your thinking for errors, omissions, blind spots, oversights, duplications or room for improvement.
That instruction changed the quality of the answer.
I was no longer reasoning from a generic model-selection guide. I had to compare my advice against the system that would actually run it.
The audit surfaced several blind spots.
The first one was the most important:
I framed model selection at the wrong altitude.
My recommendation treats "one piece, one model." Your actual production system routes per agent step inside one piece. The decision happens 8 times per article, not once.
That was the correction.
A serious writing pipeline does not ask “which model writes this article?” It asks:
The artifact is the article. The routing unit is the step.
Concept #1: Route the step, not the artifact.
A multi-stage workflow needs model decisions at each decision point,
not one model label pasted onto the final deliverable.
The second audit finding was even more basic:
I ignored the deterministic-step exit.
content_brief_synthesizer.build_prompt returns None - no model call at all.
That changed the first question.
Before asking which model should perform a step, the pipeline should ask whether the step needs inference at all.
Some steps are deterministic transforms:
Those do not need an LLM. Code is cheaper, more reliable, and easier to test.
This became Rule 0:
Before "which model?" ask "does this step need inference?"
For a knowledge entrepreneur, this is a practical discipline. Not every part of your workflow needs AI judgment. Some pieces are just process. The more clearly you separate those, the better your AI system becomes.
Concept #2: Ask whether inference is needed before choosing a
model.
The cheapest and most reliable model call is the one you do not
make.
The third useful correction was about retries.
The original guide compared model cost as if each call produced a valid result on the first try.
The actual pipeline used schema validation and retry-on-invalid behavior.
The audit stated the corrected rule:
Effective cost = price x (1 + retry rate).
That means a cheaper model can be more expensive in practice if it fails validation often.
If Haiku is cheap but retries 30 percent of the time, and Sonnet is more expensive but retries 5 percent of the time, the sticker price is not the real price. The real price includes failures, rewrites, human cleanup, and downstream damage.
That applies beyond API costs.
In a subscription-based environment, the cost may not be dollars per token. It may be time, context window, usage limits, or attention. A weak model that needs three passes is not cheaper if it burns the session and forces a human to repair it.
Concept #3: Count retries as part of cost.
A low-tier model that repeatedly fails the contract can be more
expensive than a stronger model that gets it right once.
Another correction involved fact-checking.
My first pass leaned toward “use the same tier you drafted in for fact-checking.” That can make sense for single-shot writing, but the real pipeline made the flaw visible.
Fact-checking is not primarily prose generation. It is retrieval, citation, source comparison, and claim adjudication.
The better rule was:
Anything fact-bearing needs grounding. Tier does not fix hallucination; grounding does.
In the later model-selection discussions, two spawned reviewers both flagged fact-checking as the most dangerous cell in the table. A fabricated statistic under a client’s byline is not a small writing mistake. It is a trust failure.
The improved stage design was not “use the fanciest model for fact-checking.” It was closer to:
Sonnet retrieves with forced citations.
Opus adjudicates only contested or high-stakes claims.
That split matters. Use the right shape of work, not just the hottest model.
After the audit, the recommendation became a set of pipeline rules:
Rule 0 - Ask the prior question: does this step need inference?
Rule 1 - Route per step, not per piece.
Rule 2 - Two axes: tier and grounding.
Rule 3 - Budget at the job, not the call.
Rule 4 - Effective cost includes retries.
Rule 5 - Effort scales with judgment density, not length.
Rule 6 - Cache the voice brief, then pick the tier.
Rule 7 - Name the billing regime.
The most important structural change was “name the billing regime.”
API pricing and subscription usage create different constraints.
In API mode, token cost matters directly. In subscription mode, the bigger concern may be usage limits, latency, and whether the conversation gets weighed down by unnecessary context.
The routing logic has to know which world it is operating in.
The underlying lesson is not about Claude model names.
The lesson is that model selection is a workflow-design problem.
If you are building an AI writing process, a course design process, a client onboarding process, or a strategy process, you should not start by asking:
What is the best model?
Start with:
What are the steps?
Which steps require judgment?
Which steps require grounding?
Which steps are deterministic?
Where does an error compound downstream?
Where would a retry be expensive?
Which billing or usage regime am I operating under?
Only then should you assign model and effort.
That turns model selection from preference into architecture.
The highest teaching value was the self-audit against the real codebase. It turned reasonable generic advice into operational guidance.
I compressed the Fable-versus-Opus discussion because the durable lesson is not which model wins. The durable lesson is how the routing question changed.
The single most valuable thing to add before publishing would be a real before-and-after table from the writing pipeline showing each stage’s old assumed model choice and revised per-step routing decision.