I Stopped Trying to Prompt My Way Out of AI Slop. I Built a Scoreboard Instead.

A working session where an X post about “fixing AI slop” became a real, running quality gate — one that blocked a bad change, caught a silent production dip, and turned every failure into a permanent test. Here’s the whole build, the verbatim scores, and why a measurement layer is the thing standing between your brand and the slop machine.

I’ve spent a lot of energy on the input side of AI. Better prompts. Bigger models. More context, more memory, more careful instructions. And the output still slips. Not every time — most of the time it’s fine — but every so often a draft comes back that reads clean and says nothing, or an AI feature returns a broken response, and it goes out the door because nobody was checking on the way out.

This session started with an X post by Machina (@EXM7777), “How To Fix AI Slop (Using Hermes).” Its argument stopped me: slop is an output-side problem, not a prompt problem. Every fix I’d been reaching for tunes the gun. None of them check where the bullet lands. The missing piece is a measurement layer — an eval loop — that scores every output against a defined standard and refuses to ship what’s below the line.

The post built that loop on a platform called Hermes. I don’t run Hermes; I run a folder-based “ambient intelligence” architecture. So the task I set myself was: port the eval loop onto my own architecture, then actually run it — content and product, end to end — until I could watch it catch real slop. What follows is that session: what I built, the actual numbers it produced, the two places it blocked me, and the design decision that I think matters most to anyone who publishes content or ships an AI feature.

By the end I had a running gate that scored a genuine how-to at 0.85, scored a piece of fluent fluff at 0.16, halted a code change that regressed, and rewrote its own test suite from a production failure. None of it required a better prompt.

The reframe that started it: stop fixing the input

Input-side fixes versus the output-side fix

Here’s the core of the article, in my words: prompts, bigger models, and memory are all input-side fixes. They make the next generation a little more likely to be good. But the model is non-deterministic — even a great prompt produces slop on some runs — so a better prompt is, as the post puts it, a slightly better coin flip. What’s missing isn’t a better input. It’s a check on the output: define what good looks like, turn it into a number from 0 to 1, and gate everything below a threshold (default 0.7).

That number is the whole game. “It seems better” doesn’t catch the bad run hiding in the next fifty. A number does.

Concept #1: Measure the output, don’t keep polishing the input. A better prompt is a better coin flip; a quality number is a gate. Until quality is something you can read off a scoreboard, you’re shipping on vibes.

The other thing the article nails is where slop hides. There are two surfaces, and most people only worry about one.

Two places slop hides — same disease, same cure

Content output is public. When it’s slop, it embarrasses you loudly — a hollow LinkedIn post, a newsletter that reads fine and teaches nothing. Product output is private. When your AI feature returns slop — wrong format, broken JSON, an answer that drifts off-task — nobody tweets about it. Your users just quietly churn. Same disease (un-measured AI output going straight to an audience), same cure (a gate on the output).

Concept #2: Gate both your content and your product output — the silent one costs more. Bad posts embarrass you; bad product output churns customers without a word. The eval loop is the same for both.

For a knowledge business, that second surface is the one I’d lose sleep over. If you’ve bolted an AI chatbot onto your course, or an AI “draft my email” feature into your offer, that output is your product now — and it has no quality control unless you build one.

Building the gate as a folder, not a platform

The article leans on Hermes for about six platform jobs: channels, cron, approval buttons, a kanban board, memory, and skill-writing. My architecture treats those as the host’s job — the folder declares what it needs and delegates enforcement, rather than rebuilding a platform. So the port came down to one principle: the folder is the agent the moment an LLM runs inside it. I don’t “stand up an agent and connect a channel.” I make a folder with the right pieces in it.

I scaffolded it as a working directory tree — real runnable files, not a diagram:

eval-loop/
  harness.yaml          manifest + interface contract
  AGENTS.md             the operating procedure (the six moves)
  SOUL.md               the gate's stance ("I am the gate, not the generator")
  memory/
    rubric.md           encoded taste — 4 criteria + a meta-criterion
    gold-standard/      ground-truth slot for your 20–50 best pieces
  skills/
    judge/              LLM-as-judge → 0–1 per criterion + a one-line reason
    suite/              test cases + which metric scores each
    regression-gate/    re-run cases, compare to baseline, ask for approval
  tools/
    metrics.py          exact-match · regex · json-validator · semantic-similarity
  inbox/ scored/ rejected/   candidate → verdict
  HEARTBEAT.md          cross-run state: the current score line

Two kinds of scoring live in here, and the split is important. The deterministic path is plain code — metrics.py checks things like “is this valid JSON?” or “does this match the expected pattern?” No model needed, so it’s free and instant. The judge path is an LLM reading a rubric and scoring the open-ended stuff a regex can’t catch: is this actionable, is it novel. Before going further I ran the metrics tool to confirm it actually executed — exact-match returned 1.0/0.0, the JSON validator returned a partial 0.667 on a half-valid batch. Code first, always; the model only fills the holes code can’t.

Concept #3: Split scoring into “code can check this” and “only judgment can.” Deterministic checks (format, JSON, pattern) are free and instant; reserve the expensive LLM judge for the open-ended criteria code can’t touch.

Running it live: a 0.85 and a 0.16

This is the part I wanted to see. I dropped two content drafts into inbox/ — one a genuine how-to, one a piece of confident fluff — and scored each against the rubric’s four criteria (actionable, accessible, replicable, novel) with a “would anyone bookmark this?” meta-criterion as the gate.

The slop piece is the interesting one. It reads fine — correct grammar, motivational tone. Here’s the actual score sidecar the judge wrote:

{
  "candidate": "candidate-b-slop.md",
  "per_criterion": {
    "actionable": {"score": 0.05, "reason": "No action the reader can take; pure exhortation ('start your journey today')."},
    "accessible": {"score": 0.60, "reason": "Readable, but empty — nothing to actually understand."},
    "replicable": {"score": 0.00, "reason": "No steps. Inspirational, not structured."},
    "novel":      {"score": 0.00, "reason": "Every sentence is a filler cliche; reader learns nothing new."}
  },
  "meta": {"bookmark_worthy": false, "reason": "Nobody saves this to implement later — caps aggregate at 0.5."},
  "aggregate": 0.16,
  "threshold": 0.7,
  "verdict": "kill",
  "flags": ["filler phrases: 'at the end of the day', 'it's worth noting', 'in today's fast-paced landscape', 'the possibilities are endless'"]
}

0.16. Killed. Not because it’s grammatically wrong — it isn’t — but because it’s correct on the outside and empty on the inside, and the gate measures the inside. The real how-to, by contrast, scored 0.85 and shipped: actionable 0.90, replicable 0.90, and a bookmark_worthy: true that the slop piece couldn’t earn.

That’s the whole promise of the article made concrete. Fluent emptiness is exactly the slop that slips past a human skim, because skimming rewards fluency. A number that asks “is there anything here a reader could act on?” doesn’t get fooled by tone.

Concept #4: Score the substance, not the surface — slop is fluent by design. AI slop reads clean; that’s what makes it dangerous. The criteria that catch it are “can the reader act on this?” and “would they save it?” — not grammar.

A reader at this point asked me the right question: does the loop rewrite the content, or just give a go/no-go? Go/no-go, by design — and I’ll come back to why that boundary is the most important decision in the whole build.

The gate blocking a change — for real

Scoring one draft is useful. The bigger payoff is the regression gate: when you change something — swap a model, edit a prompt — does quality drop? I set a baseline from a clean run (pipeline v1 scored 1.00, all three test cases green), recorded it, then ran a “changed” pipeline where the model started emitting Order #1234 instead of the expected ORDER-1234.

The gate caught it. One case dropped from 1.00 to 0.00, the aggregate fell from 1.00 to 0.67, and instead of shipping silently it surfaced an approval prompt. The decision I recorded matters as much as the catch:

| 2026-06-01 | baseline accepted (pipeline v1) | 1.00 | —     | accepted |
| 2026-06-01 | regression-gate (pipeline v2)   | 0.67 | -0.33 | BLOCKED — 03-format regressed 1.00→0.00; held for rework |

BLOCKED. The tempting wrong move here is to edit the test so the new output passes — loosen ORDER-1234 to also accept Order #1234, watch the number go green, ship. That games the gate instead of catching the bug. The format change was a genuine regression in the new pipeline, not a stale test, so the right response was to send the change back. A gate you quietly relax to make green isn’t a gate.

Concept #5: When the gate fails, fix the work — never loosen the test to make it green. The moment you edit the test to pass a bad output, you’ve turned your quality control into a rubber stamp.

The loop that gets harder to fool while you sleep

The last move is production-watch: a recurring job that samples real production output, scores it with the same judge, and alerts you when the line dips. I ran a sample of 8 production executions — aggregate 0.57, below the 0.7 line, dip alert fired. It flagged the bad runs.

Then the part that makes this compound. The two real failures — a broken-JSON response and a wrong-format order — got written back as permanent test cases. The format bug that I’d already caught once at the ship gate now also failed in two new production-derived cases. Same bug, three places it can never silently pass again.

The eval loop: generate, score, gate, fix, re-score, write-back

That’s the quiet superpower of the whole approach. Every failure makes the test suite harder. You don’t get smarter by writing better prompts; you get safer because the floor rises every time something slips. A month of running it and your gate has seen — and permanently encoded — every way your output has failed.

Concept #6: Turn every failure into a permanent test — that’s how the floor rises on its own. A bug you fix once can come back. A bug you encode as a test case is caught forever. The suite compounds; the prompt doesn’t.

The decision that protects your brand: the gate scores, it never rewrites

Back to that reader’s question. The gate I built scores and routes; it never edits the work. That’s a hard boundary, written into the gate’s own stance file:

I am the gate, not the generator. I never edit the work to rescue a score. I report; the generator reworks. My value is that I have no ego in the piece.

This looks like a limitation. It’s the opposite. Three reasons it’s deliberate:

A judge that also rewrites can’t be trusted. The moment the gate edits the work, it’s grading its own output — it has ego in the piece. The judge’s entire value is that it has no attachment to the one sentence you’re secretly proud of.

Generation and measurement are different jobs. You already have a generator — your prompt, your agent, your team. What was missing was the measurement. Folding rewriting back into the gate just recreates the input-side confusion the whole approach is trying to escape.

A number is debuggable; a silent rewrite isn’t. “0.81 → 0.74, novel scored 0.2” tells you exactly what broke. A draft that got quietly rewritten hides which run was bad and why.

So what comes back is an aggregate score, per-criterion scores with one-line reasons, a verdict, and the file moved to scored/ or rejected/. The reasons tell you what’s broken. The fix is yours. If you do want automatic rewriting, it lives in a separate generator that reads the gate’s reasons and reworks — the gate stays clean, the loop stays honest.

Concept #7: Keep the judge and the writer separate — a critic with edit access stops being a critic. Let the gate tell you what’s broken and why; let you or your generator do the fixing. Mixing the roles destroys the measurement.

One judge, many standards: genre ⊕ platform

Late in the session I asked the question a content business actually cares about: I don’t write one kind of thing. I write sales copy, thought-leadership, how-tos, newsletters, listicles — and I ship them to LinkedIn, X, email. Do I need a separate rubric for every combination?

If you build one file per combination, you get 8 genres × 6 platforms = 48 monoliths that drift apart. The fix is to notice that genre and platform are orthogonal. Sales copy on LinkedIn and sales copy on X share most of their standard — only the constraints differ (length, hook convention, CTA rules). So you compose instead of multiply:

Genre and platform compose: 8 + 6 = 14, not 48

resolved = genre_base ⊕ platform_overlay. The genre carries the substance (criteria, weights, threshold, exemplars). The platform is a thin delta (max length, format rules, one or two extra criteria). 8 + 6 = 14 small files, not 48. I built it and ran a live composition — a how-to written as an X thread, scored as how-to ⊕ platform:x. The four how-to criteria came from the genre; first_post_stands_alone and thread_coherence were injected by the X overlay; the threshold was the genre’s plus the overlay’s delta. Swap in platform:linkedin and you’d get hook_survives_fold instead, same genre core.

Here’s the part that closes the loop for a content business. These rubrics don’t live inside the gate — they live in a shared spot both the gate and the writer can reach. Because if the standard for “good sales copy” lives only inside the evaluator, your writing process can’t aim at it. Promote it so both share the same named standard, and the writer writes toward the exact spec the gate grades against. One contract, two consumers. That’s the difference between a critic that surprises you after the fact and a standard everyone was building toward from the start.

Concept #8: Compose your standards — write the genre once, layer the platform on top. Genre and platform are orthogonal; multiplying them is the trap. And keep the standard shared so your writing aims at the same target your gate grades against.

What this unlocks

Strip away the folders and adapters, and the transferable pattern is simple: anywhere AI output reaches an audience, put a measurement layer between the generation and the ship. Define good as a small set of named criteria. Turn each into a number. Gate below a threshold. Feed every failure back as a permanent test. Keep the thing that measures separate from the thing that makes.

You don’t need my architecture or the article’s platform to do this. Here’s how it maps onto work a knowledge entrepreneur already does:

Your newsletter. Pick your five best past issues. Write down the four things they all do (one concrete takeaway, a specific example, a clear next step, a non-obvious insight). Score every draft against those four before it sends. The number is your “should this go out” line.
Your AI course assistant or chatbot. This is the silent surface. Write 20 real questions your users ask, with the right answers. Run them every time you change the prompt or model. If the score drops, you caught a regression before a paying student did.
Your sales pages. The criteria are different (one clear CTA, a real proof point, a problem→solution arc) but the loop is identical. Score the page before it goes live.

The point isn’t the tool. It’s that “good enough to ship” stops being a feeling you have at 11pm and becomes a number you can stand behind — and improve on purpose.

Key takeaways

Slop is an output problem. Better prompts tune the input and never check the result. The fix is a gate on the output, not a smarter generator.
Quality has to become a number. “Seems better” misses the bad run hiding in the next fifty. A 0–1 score against named criteria doesn’t.
The dangerous slop is fluent. It reads clean and says nothing, which is exactly why a human skim lets it through. Score substance — “can the reader act on this?” — not surface.
Your AI product output is the silent risk. Bad content embarrasses you publicly; bad product output churns customers without a word. Both need the same gate.
Keep the judge and the writer separate. A critic with edit access grades its own work. Let the gate report what’s broken; you (or your generator) fix it.
Failures should compound. Write every caught failure back as a permanent test, and your standard gets harder to fool every week — without you writing a single new prompt.

How to start

Pull your 5–10 best pieces of one kind — your strongest newsletters, or your best sales emails. These are your “gold standard.” Don’t skip this; the whole thing is fiction without real examples of your good.
Write down 4 criteria they all share, in plain language. (“Has one concrete takeaway.” “Includes a specific example.” “Gives a clear next step.” “Says something non-obvious.”) Add one gate question: would a reader save this?
Score your next draft 0–1 on each, before you publish. Average them. Set your line at 0.7. Below the line doesn’t ship — you rework the lowest-scoring criterion and re-score.
For any AI feature you ship, write 20 real inputs with their correct outputs. Re-run them every time you change the prompt or model. A drop is a regression you caught early.
Every time something bad slips out, add it as a permanent test case. Next time, it’s caught automatically. That’s the loop closing.

You don’t need to build the folder I built. You need the habit it enforces: nothing reaches your audience without a number on it.

— Lou, walking through a session that turned a thread about “fixing AI slop” into a quality gate I’d actually trust with my brand.

Behind the Article

The highest-teaching moments were the two times the gate stopped me — the blocked regression (1.00 → 0.67) and the production dip (0.57) that rewrote the suite. Abstract claims about “measurement” don’t land; a verbatim BLOCKED line in an audit trail does. I led the live section with the slop piece’s 0.16 sidecar for the same reason — the per-criterion reasons (“pure exhortation,” “no steps”) are the most convincing argument in the whole piece that this catches what a skim misses.

I compressed hard on the architecture. The actual session spent real effort on LLM-agnostic adapters (Claude Code / Codex / Gemini), the cascade/registry mechanics, and folder-layer theory — genuinely interesting, but inside-baseball for this audience. I cut it to a single “folder, not platform” framing and spent the saved space on the two things a knowledge entrepreneur can use tomorrow: the score-don’t-rewrite boundary and the genre⊕platform composition. I also reordered: the session built the gate before discussing the no-rewrite boundary, but I moved that decision to its own late section because it’s the single most brand-relevant idea, and it deserved a spotlight, not a footnote mid-build.

The single most valuable thing to add before publishing: a screenshot or short screen-recording of the gate actually blocking the change — the approval prompt appearing in the host, the file moving to rejected/. The piece argues for show-don’t-tell, and right now the proof is JSON blocks. One real frame of the gate halting a ship would make the whole thing visceral instead of described.