Avoiding AI Slop: Why Quality Matters More than Ever

ai sdlc

Introduction

“AI slop” is what happens when an agent ships code that technically satisfies the prompt and practically poisons the codebase: dead branches, duplicated helpers, swallowed errors, ornamental tests, comments that narrate the obvious, abstractions invented for problems that do not exist. AIs can be lazy, forcing tests to pass, adding ignores, and taking shortcuts to address later.

This is not a model problem. It is a process problem. An agent given a sloppy codebase will produce sloppier code, because its context window is full of bad examples to imitate. Slop is self-reinforcing and the only durable answer is solid, fundamental engineering discipline encoded and automated.

This post catalogues the guardrails I rely on to keep agentic output clean, and explains why each one earns its place.

Key Concepts

Why slop is expensive

As models and dev tooling continues to improve, it’s tempting to treat AI slop as a cosmetic problem. Especially when you can get working software so fast now. However, for production systems these concrete costs must be considered:

  • Quality. Slop hides bugs. A test that asserts expect(result).toBeTruthy() looks green and tells you nothing. A try/catch that logs and continues turns a hard failure into a silent corruption.
  • Maintainability. Every duplicated utility, premature abstraction, or “just in case” parameter is a tax on every future change. Humans pay it in time; agents pay it in tokens and confusion.
  • Extensibility. A codebase with consistent patterns can be extended by composition. A codebase of parallel half-implementations forces every new feature to be a special case.
  • Agent efficiency. This is the modern multiplier. Agents work by pattern-matching against the existing repo. A clean codebase yields short, focused diffs; a slop-ridden one yields longer runs, more retries, more friction comments, and higher cost per ticket. Your token bill is a direct function of your code quality.

The guardrails

The best approach is layered. Each layer catches a class of slop the others miss, and the value comes from running all of them as a continuous pipeline rather than picking favourites.

1. Encoded guardrails (personas, skills, rules)

The first defence is making expectations machine-readable. I have written about this before, but this is why I built and use agent-protocols — a versioned set of personas, modular global rules (coding style, API conventions, security baseline, testing standards, etc.), and a two-tier skill library that hydrates every agent run with the constraints relevant to the task at hand. The agent does not have to remember the project’s conventions; the conventions are loaded into context every single run.

This is what shifts agents from “freelancer who has never seen the codebase” to “knowledgeable and disciplined engineer”.

2. SDLC fundamentals with automated pipelines

Discipline is process, and process is automation. CI must run on every push; failures must block merge; main must always be releasable. None of this is new. It is the same advice any senior engineer would give in 2015, but it becomes load-bearing in the agentic era because agents will cheerfully push broken code if the pipeline lets them. The pipeline is now your primary code reviewer for the dozens of routine PRs your agents are opening every day.

3. Linting and formatting

The cheapest, fastest signal in the toolbox. Biome, ESLint, Prettier, etc. will catch the trivial slop (unused imports, dead variables, inconsistent style). They also normalise the codebase, which is what gives later layers a stable surface to operate on.

4. A healthy test pyramid

A pyramid, not an inverted triangle. Most projects have the right intent (lots of unit tests, a few integration tests, a thin layer of E2E) and the wrong reality (a smattering of unit tests stapled to a wall of brittle Cypress flows). The agent-protocols testing rule splits tests into three explicit tiers: unit (pure logic, mocks at boundaries), contract (API ↔ DB shape and status assertions against a real DB), and acceptance (Gherkin scenarios over real stacks). Each tier has rules about what it may and may not assert, which prevents the most common form of test slop: status-code assertions buried in feature files, or DB checks duct-taped onto unit tests.

5. Enforced CRAP (coverage × complexity)

Coverage on its own is gameable; complexity on its own ignores risk. The CRAP score multiplies them identifying red flags of high cyclomatic complexity in an under-tested function. In my projects the CRAP baseline is enforced as a gate: new methods over a configured ceiling fail CI, and the baseline ratchets downward with each refactor. Slop tends to manifest as deeply nested, lightly tested functions; CRAP makes that visible the moment it lands.

6. Enforced maintainability baseline

Alongside CRAP, a maintainability index baseline (Halstead volume, cyclomatic complexity, lines of code) gives a holistic per-file score. Agents are particularly prone to inflating maintainability cost (e.g. adding a fourth optional parameter, copy-pasting a near-duplicate helper, “just in case” branches). A ratcheting maintainability baseline turns those tendencies around.

7. Regular audits

Linting catches the trivial. Tests catch the functional. Audits catch the architectural. The agent-protocols ship a comprehensive suite of single-command audit workflows (e.g., /audit-architecture, /audit-security, /audit-performance, etc.) that I run regularly.

Audits work best when they are routine. That’s why I have started to enforce them as gates inside the sprint pipeline. There is more work to do here, and I have experimented using multiple models to check each other’s work, which has been promising.

8. Pre-commit and pre-push hooks

Hooks are the local feedback loop and the cheapest possible enforcement point. Pre-commit stays light, fast feedback on the diff. Pre-push is the heavyweight gate that runs the same suite of checks as CI, so the laptop fails first instead of the build queue.

Pre-commit typically does just one thing well: run lint and format against the staged files via lint-staged. Fast, scoped to the diff, and impossible to skip accidentally.

Pre-push is where the real quality contract lives. A robust setup runs the full suite before a single byte hits the remote:

  • Validate — lint, typecheck, and the unit + contract test suites.
  • Dependency audit at a high-severity threshold (pnpm audit --audit-level=high or the npm equivalent).
  • Lint baseline ratchet — no new warnings allowed vs the committed baselines/lint.json.
  • Maintainability index check — the per-file score must not regress against baseline.
  • Coverage capture scoped to the diff range, so the CRAP check has fresh data.
  • CRAP check against the same diff range — any changed function over the configured ceiling fails the push.

Read top to bottom, that is the entire quality contract enforced locally. The agent cannot push slop because the laptop refuses to push it.

The point is not to replace CI — it’s to fail fast, locally.

9. Mutation testing

The final and most honest layer. Coverage tells you which lines were executed; mutation testing tells you which lines were asserted. A tool like Stryker mutates production code (flips a > to >=, deletes a return) and re-runs the unit suite — any mutant that survives is a line your tests do not actually verify. This is the single most effective defence against the most insidious form of AI slop: tests that exist only to push the coverage number up. If a mutant survives, the test is decorative; mutation testing makes that undeniable.

It is expensive to run on every push, so running it on a schedule such as nightly is fine. Just keep an eye on the trend, and make continuous improvements.

Conclusion

None of the mechanisms above are novel. Linting is older than I am. The test pyramid is twenty years old. CRAP, maintainability indices, and mutation testing are all well-trodden ground. What changes in the agentic era is the cost of skipping them. A 3-engineer team writing 50 PRs a week could absorb a little slop. A 3-engineer team running 5 agents writing 500 PRs a week cannot. The slop compounds faster than humans can reason about it, and at some point could become intractable.

The good news is that the mechanisms to prevent it are well-understood and largely off-the-shelf. If your agentic output feels sloppy, the answer is rarely a better prompt. It is a better pipeline.

Engagement

Leave a comment

0 / 1000

Comments

No comments yet. Be the first to start the conversation!