A working guide for knowledge entrepreneurs choosing between Haiku 4.5, Sonnet 4.6, Opus 4.7, and Opus 4.8 — built from what Anthropic actually published, not what the marketing says.
The right Claude for the job is almost never the most expensive one. It’s almost never the newest one either.
It’s the cheapest model that crosses the quality bar for your task. Everything else is wallet leakage.
Here’s the one-page version. The rest of this guide is the case for why.
| Work tier | Coding pick | Creative / advisory pick | Effort | Why |
|---|---|---|---|---|
| Bulk, high-volume, transform-y work | Haiku 4.5 | Sonnet 4.6 (skip Haiku) | high | Haiku handles code transformation at one-fifth the cost of Opus. It collapses on voice, judgment, and tone — wrong tool for writing or advisory at any price. |
| Everyday, mid-stakes work | Sonnet 4.6 | Sonnet 4.6 | high | The “good enough” default for both lenses. Newsletter drafts, internal memos, refactors, agentic tasks that aren’t load-bearing. |
| Flagship, your-name-on-it work | Opus 4.8 | Opus 4.8 | xhigh | 4.7 → 4.8 is a bigger lift for creative work (+96 Elo) than for coding (+1 SWE-V). Worth the upgrade either way — same price as 4.7. |
| Hard reasoning, agentic loops, complex multi-step | Opus 4.8 | Opus 4.8 | xhigh / max | When the task asks the model to think, not just transform. Reasoning is the 20-point gap between Sonnet and Opus — pay for it when it pays you back. |
| When in doubt, what do I default to? | Sonnet 4.6 | Sonnet 4.6 | high | Start cheap, escalate to Opus on failure. The model you reach for first is usually too expensive. |
That’s the field manual. Now the case.
Anthropic publishes one score per model. At max effort. That’s it.
There is no public table showing Opus 4.7 at low vs Opus 4.6 at xhigh. Anyone who tells you “Sonnet at medium beats Opus at low” is guessing — including me. The only per-effort number Anthropic has released is this: on their internal agentic coding eval, Opus 4.7 goes from ~71% at xhigh to ~74.5% at max. A 3-point lift for 2× the thinking tokens.
That’s the entire public dataset on the effort-vs-performance question.
So this guide tells you what we do know — model-to-model — and where the answer lives on your own keyboard.
A few things to notice before we move on.
SWE-bench Verified (coding): Haiku → Sonnet is a 6-point lift. Sonnet → Opus 4.7 is an 8-point lift. Opus 4.7 → Opus 4.8 is a 1-point lift at the same price.
GPQA Diamond (PhD-level reasoning): The Sonnet → Opus gap blows out from 8 points (on coding) to 20 points (74.1 vs 94.2). This is the single most important pattern in the chart — Sonnet looks fine on coding, then falls off a cliff on reasoning. If your task asks the model to think rather than transform, that 20-point gap is what you’re paying Opus for.
Dashes mean Anthropic didn’t publish that combination. Haiku has no published GPQA score. Sonnet has no published SWE-Pro. The gaps are the data.
What the chart is saying:
The interesting overlaps:
| Cost zone | What overlaps | The honest read |
|---|---|---|
| ~3¢ | Haiku 4.5 max vs Opus 4.7 low | The most attractive test. Haiku at max is 73.3% — known. Opus 4.7 at low is unmeasured but generational size says it lands higher. If you’re routing for cost, this is the boundary to probe. |
| ~5¢ | Sonnet 4.6 max vs Opus 4.7 medium | Sonnet’s ceiling is 79.6. Opus’s ceiling is 87.6. The xhigh → max delta on Opus is only ~3 points, so Opus at medium is almost certainly still above Sonnet at max. Sonnet rarely wins this overlap for coding. |
| ~7¢ | Opus 4.6 xhigh vs Opus 4.7 low | Same price tier ($25/M out), so xhigh on 4.6 lands near low on 4.7 on the cost axis. Opus 4.7 max is +6.8 points on Opus 4.6 max — bigger than a full effort step. The generational lift probably wins. Don’t run 4.6 anymore. |
Rule of thumb for coding: the generational gap between Opus versions is larger than the gap between effort levels within one version. So a newer Opus at lower effort beats an older Opus at higher effort, most of the time, at lower cost.
Different benchmark, different story. This is GDPval-AA — Anthropic’s eval for “economically valuable knowledge work” across finance, legal, advisory, writing, business analysis. The Elo scale: roughly 100 Elo gap = noticeably better; 200+ = different league.
Three things change versus the coding chart:
1. Haiku falls off a cliff. 902 Elo. That’s 500+ Elo below Sonnet. On SWE-V the Haiku → Sonnet gap was 6 points. Here it’s the gap between “useful junior” and “barely coherent.” Haiku is a transformation model — extract, classify, summarize, route. It is not a thought partner. For writing, advisory, brand voice — Haiku is the wrong default at any cost.
2. Sonnet gets more competitive. Sonnet → Opus 4.7 is only 125 Elo here, against 8 points on SWE-V. That’s roughly a 65/35 head-to-head, not a blowout. For bulk creative work — newsletter drafts, internal memos, advisory summaries — Sonnet at high is probably close enough for the cost saving.
3. The 4.7 → 4.8 upgrade actually matters. +96 Elo. That’s a meaningfully bigger jump than the +1 SWE-V point you got on coding. If you write or advise for a living, 4.8 is the move. Same price as 4.7, real lift.
The interesting overlaps:
| Cost zone | What overlaps | The honest read |
|---|---|---|
| ~3¢ | Haiku max vs Sonnet low | Sonnet wins by a mile. The 500-Elo gap doesn’t close with effort levels. Sonnet is the floor for any creative or advisory work. |
| ~5¢ | Sonnet high vs Opus 4.7 low | The most interesting overlap. Sonnet at high is fine for second-tier work — drafts, summaries, anything that’s “good enough.” Opus earns its keep on flagship work, named-byline articles, anything with your face on it. |
| ~7¢ | Sonnet max vs Opus 4.8 medium | If Opus 4.8’s known +96 Elo lift carries down through effort levels, 4.8 at medium probably beats Sonnet at max in the same cost band. Worth a private A/B if you publish regularly. |
This is the part I have to be straight with you about.
Anthropic gave us one per-effort delta on one benchmark for one model. Opus 4.7 xhigh → max = +3 points at 2× the tokens.
That’s it. The rest of the curve is empty. Every chart in this guide shows a dashed line because the public data does not exist to fill it in.
What that means in practice:
The fastest way to answer it for your own work is below.
Stop reading benchmarks. Run one. Here’s the minimum-viable version.
Setup:
Sonnet 4.6 @ high,
Opus 4.8 @ medium, Opus 4.8 @ xhigh.Score:
Decide:
Roughly $30-50 of inference. Half a day of setup. You’d know more about your real model/effort sweet spot than every blog post on the internet combined.
This is the move. The benchmark for your work is your work.
Three rules. Keep them visible.
1. Default cheap. Escalate on failure. Start with the cheapest model that might work. Move up only when you see it fail. The model you reach for first is usually too expensive. The smart intern can handle 80% of the work and leave you the 20%.
2. The generational gap usually beats the effort gap. A newer model at lower effort almost always beats an older model at higher effort, at lower cost. Opus 4.7 → 4.8 is the same price — no reason to stay on 4.7. Same logic when Opus 4.9 ships next quarter.
3. The benchmark for your work is your work. Public benchmarks are model-to-model. Your task distribution is yours. If a routing decision matters enough to argue about, it matters enough to run a 90-call private eval before you commit.
A few things I’m keeping an eye on, in case you are too.
That’s the guide. Three models, two charts, three rules, one private eval. Print it, share it, use it.
If you’re routing a lot of traffic through these models and want a hand setting up your own eval, that’s exactly the kind of thing we work through together in the group.
Let’s serve people, do good, have fun and make money — abundantly. Namaste.
Sources: Introducing Claude Opus 4.7 (Anthropic) · Claude Opus 4.8 benchmarks explained (Vellum) · Opus 4.7 xhigh mode explained (apiyi) · GDPval-AA Leaderboard (llm-stats) · Claude Benchmarks 2026 (morphllm)