Which Claude, When

A working guide for knowledge entrepreneurs choosing between Haiku 4.5, Sonnet 4.6, Opus 4.7, and Opus 4.8 — built from what Anthropic actually published, not what the marketing says.

The landing, up front

The right Claude for the job is almost never the most expensive one. It’s almost never the newest one either.

It’s the cheapest model that crosses the quality bar for your task. Everything else is wallet leakage.

Here’s the one-page version. The rest of this guide is the case for why.

Work tier Coding pick Creative / advisory pick Effort Why
Bulk, high-volume, transform-y work Haiku 4.5 Sonnet 4.6 (skip Haiku) high Haiku handles code transformation at one-fifth the cost of Opus. It collapses on voice, judgment, and tone — wrong tool for writing or advisory at any price.
Everyday, mid-stakes work Sonnet 4.6 Sonnet 4.6 high The “good enough” default for both lenses. Newsletter drafts, internal memos, refactors, agentic tasks that aren’t load-bearing.
Flagship, your-name-on-it work Opus 4.8 Opus 4.8 xhigh 4.7 → 4.8 is a bigger lift for creative work (+96 Elo) than for coding (+1 SWE-V). Worth the upgrade either way — same price as 4.7.
Hard reasoning, agentic loops, complex multi-step Opus 4.8 Opus 4.8 xhigh / max When the task asks the model to think, not just transform. Reasoning is the 20-point gap between Sonnet and Opus — pay for it when it pays you back.
When in doubt, what do I default to? Sonnet 4.6 Sonnet 4.6 high Start cheap, escalate to Opus on failure. The model you reach for first is usually too expensive.

That’s the field manual. Now the case.

The honest caveat (read this once, then we move on)

Anthropic publishes one score per model. At max effort. That’s it.

There is no public table showing Opus 4.7 at low vs Opus 4.6 at xhigh. Anyone who tells you “Sonnet at medium beats Opus at low” is guessing — including me. The only per-effort number Anthropic has released is this: on their internal agentic coding eval, Opus 4.7 goes from ~71% at xhigh to ~74.5% at max. A 3-point lift for 2× the thinking tokens.

That’s the entire public dataset on the effort-vs-performance question.

So this guide tells you what we do know — model-to-model — and where the answer lives on your own keyboard.

The four models, side by side

Multi-benchmark comparison across Claude 4.x models

A few things to notice before we move on.

SWE-bench Verified (coding): Haiku → Sonnet is a 6-point lift. Sonnet → Opus 4.7 is an 8-point lift. Opus 4.7 → Opus 4.8 is a 1-point lift at the same price.

GPQA Diamond (PhD-level reasoning): The Sonnet → Opus gap blows out from 8 points (on coding) to 20 points (74.1 vs 94.2). This is the single most important pattern in the chart — Sonnet looks fine on coding, then falls off a cliff on reasoning. If your task asks the model to think rather than transform, that 20-point gap is what you’re paying Opus for.

Dashes mean Anthropic didn’t publish that combination. Haiku has no published GPQA score. Sonnet has no published SWE-Pro. The gaps are the data.

Coding work — the sweet spot view

Cost overlap chart for coding work

What the chart is saying:

The interesting overlaps:

Cost zone What overlaps The honest read
~3¢ Haiku 4.5 max vs Opus 4.7 low The most attractive test. Haiku at max is 73.3% — known. Opus 4.7 at low is unmeasured but generational size says it lands higher. If you’re routing for cost, this is the boundary to probe.
~5¢ Sonnet 4.6 max vs Opus 4.7 medium Sonnet’s ceiling is 79.6. Opus’s ceiling is 87.6. The xhigh → max delta on Opus is only ~3 points, so Opus at medium is almost certainly still above Sonnet at max. Sonnet rarely wins this overlap for coding.
~7¢ Opus 4.6 xhigh vs Opus 4.7 low Same price tier ($25/M out), so xhigh on 4.6 lands near low on 4.7 on the cost axis. Opus 4.7 max is +6.8 points on Opus 4.6 max — bigger than a full effort step. The generational lift probably wins. Don’t run 4.6 anymore.

Rule of thumb for coding: the generational gap between Opus versions is larger than the gap between effort levels within one version. So a newer Opus at lower effort beats an older Opus at higher effort, most of the time, at lower cost.

Creative, advisory, writing, sales — the sweet spot view

Cost overlap chart for creative and advisory work

Different benchmark, different story. This is GDPval-AA — Anthropic’s eval for “economically valuable knowledge work” across finance, legal, advisory, writing, business analysis. The Elo scale: roughly 100 Elo gap = noticeably better; 200+ = different league.

Three things change versus the coding chart:

1. Haiku falls off a cliff. 902 Elo. That’s 500+ Elo below Sonnet. On SWE-V the Haiku → Sonnet gap was 6 points. Here it’s the gap between “useful junior” and “barely coherent.” Haiku is a transformation model — extract, classify, summarize, route. It is not a thought partner. For writing, advisory, brand voice — Haiku is the wrong default at any cost.

2. Sonnet gets more competitive. Sonnet → Opus 4.7 is only 125 Elo here, against 8 points on SWE-V. That’s roughly a 65/35 head-to-head, not a blowout. For bulk creative work — newsletter drafts, internal memos, advisory summaries — Sonnet at high is probably close enough for the cost saving.

3. The 4.7 → 4.8 upgrade actually matters. +96 Elo. That’s a meaningfully bigger jump than the +1 SWE-V point you got on coding. If you write or advise for a living, 4.8 is the move. Same price as 4.7, real lift.

The interesting overlaps:

Cost zone What overlaps The honest read
~3¢ Haiku max vs Sonnet low Sonnet wins by a mile. The 500-Elo gap doesn’t close with effort levels. Sonnet is the floor for any creative or advisory work.
~5¢ Sonnet high vs Opus 4.7 low The most interesting overlap. Sonnet at high is fine for second-tier work — drafts, summaries, anything that’s “good enough.” Opus earns its keep on flagship work, named-byline articles, anything with your face on it.
~7¢ Sonnet max vs Opus 4.8 medium If Opus 4.8’s known +96 Elo lift carries down through effort levels, 4.8 at medium probably beats Sonnet at max in the same cost band. Worth a private A/B if you publish regularly.

The effort question (the part nobody can answer)

This is the part I have to be straight with you about.

Anthropic gave us one per-effort delta on one benchmark for one model. Opus 4.7 xhigh → max = +3 points at 2× the tokens.

That’s it. The rest of the curve is empty. Every chart in this guide shows a dashed line because the public data does not exist to fill it in.

What that means in practice:

The fastest way to answer it for your own work is below.

The 20-minute private eval (the part that actually answers your question)

Stop reading benchmarks. Run one. Here’s the minimum-viable version.

Setup:

  1. Pick two tasks from your actual workload. Not toy tasks — real ones. e.g., “rewrite this article lead in my voice” and “summarize this client call into three insights.”
  2. Grab 15 real inputs per task. Last month’s emails, last week’s drafts. Whatever you actually do.
  3. Pick three (model, effort) cells worth testing. For creative work I’d start with: Sonnet 4.6 @ high, Opus 4.8 @ medium, Opus 4.8 @ xhigh.
  4. Run each input through each cell. 15 × 2 × 3 = 90 calls.

Score:

  1. Either grade them yourself with a 1-5 rubric (voice fidelity, insight depth, polish, total) — or hand the outputs to Sonnet 4.6 as a judge with the same rubric.
  2. Average the scores per cell. Tally cost per cell.

Decide:

  1. Pick the cheapest cell that hits your quality bar.

Roughly $30-50 of inference. Half a day of setup. You’d know more about your real model/effort sweet spot than every blog post on the internet combined.

This is the move. The benchmark for your work is your work.

The decision rules — printable version

Three rules. Keep them visible.

1. Default cheap. Escalate on failure. Start with the cheapest model that might work. Move up only when you see it fail. The model you reach for first is usually too expensive. The smart intern can handle 80% of the work and leave you the 20%.

2. The generational gap usually beats the effort gap. A newer model at lower effort almost always beats an older model at higher effort, at lower cost. Opus 4.7 → 4.8 is the same price — no reason to stay on 4.7. Same logic when Opus 4.9 ships next quarter.

3. The benchmark for your work is your work. Public benchmarks are model-to-model. Your task distribution is yours. If a routing decision matters enough to argue about, it matters enough to run a 90-call private eval before you commit.

What I’m watching next

A few things I’m keeping an eye on, in case you are too.

That’s the guide. Three models, two charts, three rules, one private eval. Print it, share it, use it.

If you’re routing a lot of traffic through these models and want a hand setting up your own eval, that’s exactly the kind of thing we work through together in the group.

Let’s serve people, do good, have fun and make money — abundantly. Namaste.

Sources: Introducing Claude Opus 4.7 (Anthropic) · Claude Opus 4.8 benchmarks explained (Vellum) · Opus 4.7 xhigh mode explained (apiyi) · GDPval-AA Leaderboard (llm-stats) · Claude Benchmarks 2026 (morphllm)