I Asked Which Model to Use. The Codebase Answered: For Which Step?

How a model-selection guide became much more useful after I audited it against a real agentic writing pipeline.

This article is about a mistake I made in the shape of my advice.

Later in the series, the forked writing team shows this routing logic in a real workflow. If you want the worked example after this concept piece, read A Worked Example: The Forked Writing Team.

That matters here because model selection looks different once you stop thinking of writing as “one prompt produces one article” and start thinking of it as a pipeline.

The original question was straightforward:

analyze the recommendations in https://brave-quiche-rax4.here.now/aimm-model-selection-guide.html
where do you agree/disagree?

I gave a useful answer. But it was useful at the wrong altitude.

The First Recommendation

My first pass focused on the obvious dimensions:

stakes
grounding
effort
caching
fact-checking risk

The core recommendation was:

Add a second axis to the decision tree: stakes x grounding requirement. Tier picks the model; grounding picks whether web search is mandatory.

That was not wrong.

A bylined thought-leadership piece has different stakes than a transcript cleanup. A fact-heavy article needs grounding. Effort should scale with reasoning complexity, not word count. Caching can matter more than model tier for recurring formats.

All of that still holds.

But I was implicitly treating the article as the unit of routing.

The advice sounded like:

For this piece, choose this model.

That missed how the actual system worked.

The Audit

The user then pointed me at the real codebase:

@"~/GitHub/code experiments/agentic-writing-team"
audit your thinking for errors, omissions, blind spots, oversights, duplications or room for improvement.

That instruction changed the quality of the answer.

I was no longer reasoning from a generic model-selection guide. I had to compare my advice against the system that would actually run it.

The audit surfaced several blind spots.

The first one was the most important:

I framed model selection at the wrong altitude.
My recommendation treats "one piece, one model." Your actual production system routes per agent step inside one piece. The decision happens 8 times per article, not once.

That was the correction.

A serious writing pipeline does not ask “which model writes this article?” It asks:

Which model should evaluate the audience?
Which model should select the angle?
Which model should draft?
Which model should review?
Which model should fact-check?
Which steps need no model at all?

The artifact is the article. The routing unit is the step.

Concept #1: Route the step, not the artifact.
A multi-stage workflow needs model decisions at each decision point, not one model label pasted onto the final deliverable.

The Deterministic Exit

The second audit finding was even more basic:

I ignored the deterministic-step exit.
content_brief_synthesizer.build_prompt returns None - no model call at all.

That changed the first question.

Before asking which model should perform a step, the pipeline should ask whether the step needs inference at all.

Some steps are deterministic transforms:

assemble a schema
merge artifact paths
format JSON
copy a validated field
combine existing outputs into a known template

Those do not need an LLM. Code is cheaper, more reliable, and easier to test.

This became Rule 0:

Before "which model?" ask "does this step need inference?"

For a knowledge entrepreneur, this is a practical discipline. Not every part of your workflow needs AI judgment. Some pieces are just process. The more clearly you separate those, the better your AI system becomes.

Concept #2: Ask whether inference is needed before choosing a model.
The cheapest and most reliable model call is the one you do not make.

Retry Cost Changed the Math

The third useful correction was about retries.

The original guide compared model cost as if each call produced a valid result on the first try.

The actual pipeline used schema validation and retry-on-invalid behavior.

The audit stated the corrected rule:

Effective cost = price x (1 + retry rate).

That means a cheaper model can be more expensive in practice if it fails validation often.

If Haiku is cheap but retries 30 percent of the time, and Sonnet is more expensive but retries 5 percent of the time, the sticker price is not the real price. The real price includes failures, rewrites, human cleanup, and downstream damage.

That applies beyond API costs.

In a subscription-based environment, the cost may not be dollars per token. It may be time, context window, usage limits, or attention. A weak model that needs three passes is not cheaper if it burns the session and forces a human to repair it.

Concept #3: Count retries as part of cost.
A low-tier model that repeatedly fails the contract can be more expensive than a stronger model that gets it right once.

Grounding Was Not the Same as Intelligence

Another correction involved fact-checking.

My first pass leaned toward “use the same tier you drafted in for fact-checking.” That can make sense for single-shot writing, but the real pipeline made the flaw visible.

Fact-checking is not primarily prose generation. It is retrieval, citation, source comparison, and claim adjudication.

The better rule was:

Anything fact-bearing needs grounding. Tier does not fix hallucination; grounding does.

In the later model-selection discussions, two spawned reviewers both flagged fact-checking as the most dangerous cell in the table. A fabricated statistic under a client’s byline is not a small writing mistake. It is a trust failure.

The improved stage design was not “use the fanciest model for fact-checking.” It was closer to:

Sonnet retrieves with forced citations.
Opus adjudicates only contested or high-stakes claims.

That split matters. Use the right shape of work, not just the hottest model.

The Revised Rules

After the audit, the recommendation became a set of pipeline rules:

Rule 0 - Ask the prior question: does this step need inference?
Rule 1 - Route per step, not per piece.
Rule 2 - Two axes: tier and grounding.
Rule 3 - Budget at the job, not the call.
Rule 4 - Effective cost includes retries.
Rule 5 - Effort scales with judgment density, not length.
Rule 6 - Cache the voice brief, then pick the tier.
Rule 7 - Name the billing regime.

The most important structural change was “name the billing regime.”

API pricing and subscription usage create different constraints.

In API mode, token cost matters directly. In subscription mode, the bigger concern may be usage limits, latency, and whether the conversation gets weighed down by unnecessary context.

The routing logic has to know which world it is operating in.

What This Unlocks

The underlying lesson is not about Claude model names.

The lesson is that model selection is a workflow-design problem.

If you are building an AI writing process, a course design process, a client onboarding process, or a strategy process, you should not start by asking:

What is the best model?

Start with:

What are the steps?
Which steps require judgment?
Which steps require grounding?
Which steps are deterministic?
Where does an error compound downstream?
Where would a retry be expensive?
Which billing or usage regime am I operating under?

Only then should you assign model and effort.

That turns model selection from preference into architecture.

Key Takeaways

The first useful correction was altitude: route per step, not per artifact.
Ask whether a step needs inference before choosing a model.
Grounding and intelligence are separate decisions.
Retry rate changes real cost.
Subscription environments shift the concern from API dollars to context, usage limits, and workflow reliability.

How to Start

Write the stages of your workflow as a list.
Mark deterministic steps that should be code or templates, not model calls.
Mark judgment-heavy steps where errors compound.
Mark fact-bearing steps that require retrieval or citations.
Assign model and effort per step, with a one-line rationale.

Behind the Article

The highest teaching value was the self-audit against the real codebase. It turned reasonable generic advice into operational guidance.

I compressed the Fable-versus-Opus discussion because the durable lesson is not which model wins. The durable lesson is how the routing question changed.

The single most valuable thing to add before publishing would be a real before-and-after table from the writing pipeline showing each stage’s old assumed model choice and revised per-step routing decision.