A working session where an X post about “fixing AI slop” became a real, running quality gate — one that blocked a bad change, caught a silent production dip, and turned every failure into a permanent test. Here’s the whole build, the verbatim scores, and why a measurement layer is the thing standing between your brand and the slop machine.
I’ve spent a lot of energy on the input side of AI. Better prompts. Bigger models. More context, more memory, more careful instructions. And the output still slips. Not every time — most of the time it’s fine — but every so often a draft comes back that reads clean and says nothing, or an AI feature returns a broken response, and it goes out the door because nobody was checking on the way out.
This session started with an X post by Machina (@EXM7777), “How To Fix AI Slop (Using Hermes).” Its argument stopped me: slop is an output-side problem, not a prompt problem. Every fix I’d been reaching for tunes the gun. None of them check where the bullet lands. The missing piece is a measurement layer — an eval loop — that scores every output against a defined standard and refuses to ship what’s below the line.
The post built that loop on a platform called Hermes. I don’t run Hermes; I run a folder-based “ambient intelligence” architecture. So the task I set myself was: port the eval loop onto my own architecture, then actually run it — content and product, end to end — until I could watch it catch real slop. What follows is that session: what I built, the actual numbers it produced, the two places it blocked me, and the design decision that I think matters most to anyone who publishes content or ships an AI feature.
By the end I had a running gate that scored a genuine how-to at 0.85, scored a piece of fluent fluff at 0.16, halted a code change that regressed, and rewrote its own test suite from a production failure. None of it required a better prompt.
Here’s the core of the article, in my words: prompts, bigger models, and memory are all input-side fixes. They make the next generation a little more likely to be good. But the model is non-deterministic — even a great prompt produces slop on some runs — so a better prompt is, as the post puts it, a slightly better coin flip. What’s missing isn’t a better input. It’s a check on the output: define what good looks like, turn it into a number from 0 to 1, and gate everything below a threshold (default 0.7).
That number is the whole game. “It seems better” doesn’t catch the bad run hiding in the next fifty. A number does.
Concept #1: Measure the output, don’t keep polishing the input. A better prompt is a better coin flip; a quality number is a gate. Until quality is something you can read off a scoreboard, you’re shipping on vibes.
The other thing the article nails is where slop hides. There are two surfaces, and most people only worry about one.
Content output is public. When it’s slop, it embarrasses you loudly — a hollow LinkedIn post, a newsletter that reads fine and teaches nothing. Product output is private. When your AI feature returns slop — wrong format, broken JSON, an answer that drifts off-task — nobody tweets about it. Your users just quietly churn. Same disease (un-measured AI output going straight to an audience), same cure (a gate on the output).
Concept #2: Gate both your content and your product output — the silent one costs more. Bad posts embarrass you; bad product output churns customers without a word. The eval loop is the same for both.
For a knowledge business, that second surface is the one I’d lose sleep over. If you’ve bolted an AI chatbot onto your course, or an AI “draft my email” feature into your offer, that output is your product now — and it has no quality control unless you build one.
The article leans on Hermes for about six platform jobs: channels, cron, approval buttons, a kanban board, memory, and skill-writing. My architecture treats those as the host’s job — the folder declares what it needs and delegates enforcement, rather than rebuilding a platform. So the port came down to one principle: the folder is the agent the moment an LLM runs inside it. I don’t “stand up an agent and connect a channel.” I make a folder with the right pieces in it.
I scaffolded it as a working directory tree — real runnable files, not a diagram:
eval-loop/
harness.yaml manifest + interface contract
AGENTS.md the operating procedure (the six moves)
SOUL.md the gate's stance ("I am the gate, not the generator")
memory/
rubric.md encoded taste — 4 criteria + a meta-criterion
gold-standard/ ground-truth slot for your 20–50 best pieces
skills/
judge/ LLM-as-judge → 0–1 per criterion + a one-line reason
suite/ test cases + which metric scores each
regression-gate/ re-run cases, compare to baseline, ask for approval
tools/
metrics.py exact-match · regex · json-validator · semantic-similarity
inbox/ scored/ rejected/ candidate → verdict
HEARTBEAT.md cross-run state: the current score line
Two kinds of scoring live in here, and the split is important. The
deterministic path is plain code —
metrics.py checks things like “is this valid JSON?” or
“does this match the expected pattern?” No model needed, so it’s free
and instant. The judge path is an LLM reading a rubric
and scoring the open-ended stuff a regex can’t catch: is this
actionable, is it novel. Before going further I ran
the metrics tool to confirm it actually executed — exact-match returned
1.0/0.0, the JSON validator returned a partial 0.667 on a half-valid
batch. Code first, always; the model only fills the holes code
can’t.
Concept #3: Split scoring into “code can check this” and “only judgment can.” Deterministic checks (format, JSON, pattern) are free and instant; reserve the expensive LLM judge for the open-ended criteria code can’t touch.
This is the part I wanted to see. I dropped two content drafts into
inbox/ — one a genuine how-to, one a piece of confident
fluff — and scored each against the rubric’s four criteria (actionable,
accessible, replicable, novel) with a “would anyone bookmark this?”
meta-criterion as the gate.
The slop piece is the interesting one. It reads fine — correct grammar, motivational tone. Here’s the actual score sidecar the judge wrote:
{
"candidate": "candidate-b-slop.md",
"per_criterion": {
"actionable": {"score": 0.05, "reason": "No action the reader can take; pure exhortation ('start your journey today')."},
"accessible": {"score": 0.60, "reason": "Readable, but empty — nothing to actually understand."},
"replicable": {"score": 0.00, "reason": "No steps. Inspirational, not structured."},
"novel": {"score": 0.00, "reason": "Every sentence is a filler cliche; reader learns nothing new."}
},
"meta": {"bookmark_worthy": false, "reason": "Nobody saves this to implement later — caps aggregate at 0.5."},
"aggregate": 0.16,
"threshold": 0.7,
"verdict": "kill",
"flags": ["filler phrases: 'at the end of the day', 'it's worth noting', 'in today's fast-paced landscape', 'the possibilities are endless'"]
}0.16. Killed. Not because it’s grammatically wrong —
it isn’t — but because it’s correct on the outside and empty on the
inside, and the gate measures the inside. The real how-to, by contrast,
scored 0.85 and shipped: actionable 0.90, replicable 0.90, and a
bookmark_worthy: true that the slop piece couldn’t
earn.
That’s the whole promise of the article made concrete. Fluent emptiness is exactly the slop that slips past a human skim, because skimming rewards fluency. A number that asks “is there anything here a reader could act on?” doesn’t get fooled by tone.
Concept #4: Score the substance, not the surface — slop is fluent by design. AI slop reads clean; that’s what makes it dangerous. The criteria that catch it are “can the reader act on this?” and “would they save it?” — not grammar.
A reader at this point asked me the right question: does the loop rewrite the content, or just give a go/no-go? Go/no-go, by design — and I’ll come back to why that boundary is the most important decision in the whole build.
Scoring one draft is useful. The bigger payoff is the
regression gate: when you change something — swap a
model, edit a prompt — does quality drop? I set a baseline from a clean
run (pipeline v1 scored 1.00, all three test cases green), recorded it,
then ran a “changed” pipeline where the model started emitting
Order #1234 instead of the expected
ORDER-1234.
The gate caught it. One case dropped from 1.00 to 0.00, the aggregate fell from 1.00 to 0.67, and instead of shipping silently it surfaced an approval prompt. The decision I recorded matters as much as the catch:
| 2026-06-01 | baseline accepted (pipeline v1) | 1.00 | — | accepted |
| 2026-06-01 | regression-gate (pipeline v2) | 0.67 | -0.33 | BLOCKED — 03-format regressed 1.00→0.00; held for rework |
BLOCKED. The tempting wrong move here is to edit the
test so the new output passes — loosen ORDER-1234 to also
accept Order #1234, watch the number go green, ship. That
games the gate instead of catching the bug. The format change was a
genuine regression in the new pipeline, not a stale test, so the right
response was to send the change back. A gate you quietly relax to make
green isn’t a gate.
Concept #5: When the gate fails, fix the work — never loosen the test to make it green. The moment you edit the test to pass a bad output, you’ve turned your quality control into a rubber stamp.
The last move is production-watch: a recurring job that samples real production output, scores it with the same judge, and alerts you when the line dips. I ran a sample of 8 production executions — aggregate 0.57, below the 0.7 line, dip alert fired. It flagged the bad runs.
Then the part that makes this compound. The two real failures — a broken-JSON response and a wrong-format order — got written back as permanent test cases. The format bug that I’d already caught once at the ship gate now also failed in two new production-derived cases. Same bug, three places it can never silently pass again.
That’s the quiet superpower of the whole approach. Every failure makes the test suite harder. You don’t get smarter by writing better prompts; you get safer because the floor rises every time something slips. A month of running it and your gate has seen — and permanently encoded — every way your output has failed.
Concept #6: Turn every failure into a permanent test — that’s how the floor rises on its own. A bug you fix once can come back. A bug you encode as a test case is caught forever. The suite compounds; the prompt doesn’t.
Back to that reader’s question. The gate I built scores and routes; it never edits the work. That’s a hard boundary, written into the gate’s own stance file:
I am the gate, not the generator. I never edit the work to rescue a score. I report; the generator reworks. My value is that I have no ego in the piece.
This looks like a limitation. It’s the opposite. Three reasons it’s deliberate:
A judge that also rewrites can’t be trusted. The moment the gate edits the work, it’s grading its own output — it has ego in the piece. The judge’s entire value is that it has no attachment to the one sentence you’re secretly proud of.
Generation and measurement are different jobs. You already have a generator — your prompt, your agent, your team. What was missing was the measurement. Folding rewriting back into the gate just recreates the input-side confusion the whole approach is trying to escape.
A number is debuggable; a silent rewrite isn’t. “0.81 → 0.74, novel scored 0.2” tells you exactly what broke. A draft that got quietly rewritten hides which run was bad and why.
So what comes back is an aggregate score, per-criterion scores
with one-line reasons, a verdict, and the file moved to
scored/ or rejected/. The reasons tell you
what’s broken. The fix is yours. If you do want automatic
rewriting, it lives in a separate generator that reads the gate’s
reasons and reworks — the gate stays clean, the loop stays honest.
Concept #7: Keep the judge and the writer separate — a critic with edit access stops being a critic. Let the gate tell you what’s broken and why; let you or your generator do the fixing. Mixing the roles destroys the measurement.
Late in the session I asked the question a content business actually cares about: I don’t write one kind of thing. I write sales copy, thought-leadership, how-tos, newsletters, listicles — and I ship them to LinkedIn, X, email. Do I need a separate rubric for every combination?
If you build one file per combination, you get 8 genres × 6 platforms = 48 monoliths that drift apart. The fix is to notice that genre and platform are orthogonal. Sales copy on LinkedIn and sales copy on X share most of their standard — only the constraints differ (length, hook convention, CTA rules). So you compose instead of multiply:
resolved = genre_base ⊕ platform_overlay. The genre
carries the substance (criteria, weights, threshold, exemplars). The
platform is a thin delta (max length, format rules, one or two extra
criteria). 8 + 6 = 14 small files, not 48. I built it
and ran a live composition — a how-to written as an X thread, scored as
how-to ⊕ platform:x. The four how-to criteria came from the
genre; first_post_stands_alone and
thread_coherence were injected by the X overlay; the
threshold was the genre’s plus the overlay’s delta. Swap in
platform:linkedin and you’d get
hook_survives_fold instead, same genre core.
Here’s the part that closes the loop for a content business. These rubrics don’t live inside the gate — they live in a shared spot both the gate and the writer can reach. Because if the standard for “good sales copy” lives only inside the evaluator, your writing process can’t aim at it. Promote it so both share the same named standard, and the writer writes toward the exact spec the gate grades against. One contract, two consumers. That’s the difference between a critic that surprises you after the fact and a standard everyone was building toward from the start.
Concept #8: Compose your standards — write the genre once, layer the platform on top. Genre and platform are orthogonal; multiplying them is the trap. And keep the standard shared so your writing aims at the same target your gate grades against.
Strip away the folders and adapters, and the transferable pattern is simple: anywhere AI output reaches an audience, put a measurement layer between the generation and the ship. Define good as a small set of named criteria. Turn each into a number. Gate below a threshold. Feed every failure back as a permanent test. Keep the thing that measures separate from the thing that makes.
You don’t need my architecture or the article’s platform to do this. Here’s how it maps onto work a knowledge entrepreneur already does:
The point isn’t the tool. It’s that “good enough to ship” stops being a feeling you have at 11pm and becomes a number you can stand behind — and improve on purpose.
You don’t need to build the folder I built. You need the habit it enforces: nothing reaches your audience without a number on it.
— Lou, walking through a session that turned a thread about “fixing AI slop” into a quality gate I’d actually trust with my brand.
The highest-teaching moments were the two times the gate
stopped me — the blocked regression (1.00 → 0.67) and the
production dip (0.57) that rewrote the suite. Abstract claims about
“measurement” don’t land; a verbatim BLOCKED line in an
audit trail does. I led the live section with the slop piece’s 0.16
sidecar for the same reason — the per-criterion reasons (“pure
exhortation,” “no steps”) are the most convincing argument in the whole
piece that this catches what a skim misses.
I compressed hard on the architecture. The actual session spent real effort on LLM-agnostic adapters (Claude Code / Codex / Gemini), the cascade/registry mechanics, and folder-layer theory — genuinely interesting, but inside-baseball for this audience. I cut it to a single “folder, not platform” framing and spent the saved space on the two things a knowledge entrepreneur can use tomorrow: the score-don’t-rewrite boundary and the genre⊕platform composition. I also reordered: the session built the gate before discussing the no-rewrite boundary, but I moved that decision to its own late section because it’s the single most brand-relevant idea, and it deserved a spotlight, not a footnote mid-build.
The single most valuable thing to add before publishing: a
screenshot or short screen-recording of the gate actually blocking the
change — the approval prompt appearing in the host, the file
moving to rejected/. The piece argues for show-don’t-tell,
and right now the proof is JSON blocks. One real frame of the gate
halting a ship would make the whole thing visceral instead of
described.