Teardown

Why most AI automation projects break in production

May 14, 2026 ai-honest 9 min read

Most AI automation projects pass demo, ship to production, and quietly break three to six weeks later. Not loudly - there's no exception in the log, no Slack alert at 3am. The workflow keeps running. The outputs keep flowing. Nobody notices for a while.

Then someone catches a duplicate invoice. Or a lead that was qualified as "hot" turns out to be a cold inquiry from 2023. Or the daily summary email subtly stops including one of the data sources. And the team learns, painfully, that the "automation" has been silently degrading for weeks.

I've cleaned up enough of these to recognize the pattern. Five failure modes show up in roughly every other engagement.

1. The prompt is unversioned

This is the most common one. The team built the workflow in a no-code tool or a Python notebook. The prompt is a multiline string somewhere. It's been edited fourteen times since launch. Nobody knows which version is currently in production. Nobody knows what was different about the version that was working two weeks ago.

The fix is unglamorous: treat prompts like code. Put them in your repo. Version them. Tag them with the model version they were tested against. Test outputs against a small but real dataset on every change. The companies that get this right have a prompts directory that looks roughly like a small set of unit tests.

2. The model changed underneath you

OpenAI ships a new model version. Anthropic deprecates an old one. Even within the same nominal model, weights get tuned. The provider's API didn't change. Your code didn't change. The workflow's output drifted anyway.

If your automation depends on the model returning JSON in a specific shape, or a specific kind of summary, or a specific classification taxonomy - you need monitoring on the output, not just on the API call status. A quick eval suite that runs against a frozen set of inputs once a day, comparing output deltas, catches this before customers do. It is approximately ten lines of code. Most teams don't write it.

3. There's no human in the loop where there should be

The demo showed it working end-to-end without human intervention. The team got excited and shipped it without one. Now an LLM is making decisions that occasionally affect money, customer perception, or legal exposure.

For a class of decisions - qualification ambiguous, confidence below a threshold, anything touching billing or contracts - a human review step isn't a step backwards. It's the difference between a workflow you can defend to your team and a workflow you can't.

I'll write more about exactly where the human step belongs in a separate piece. The short version: any decision that's reversible and high-volume, automate freely. Anything irreversible or low-volume, keep a human in the loop. Anything in between, route by confidence score.

4. Failure modes are invisible

The API call returned 200. The JSON parsed. The workflow continued. Nothing was wrong from your monitoring's perspective.

But the model returned an empty string for a field that was supposed to contain the customer name. Or it hallucinated a SKU that doesn't exist. Or it categorised a refund request as a sales inquiry.

Observability for AI workflows is fundamentally different from observability for deterministic systems. You need to instrument the content of the output, not just the success of the call. Anomaly detection on output distribution. Sampling 1% of outputs for human review. Alerting when the rate of empty fields, low-confidence classifications, or specific failure tokens exceeds a baseline.

This is not exotic. It is also rarely shipped.

5. The workflow runs on platform glue with no real backend

The team built it in Make, n8n, or Zapier, with a thin layer of glue between APIs. It worked beautifully in development. It works mostly in production. But there's no queue, no retry semantics, no dead-letter pattern, no idempotency, no transactional boundary.

When an integration upstream changes timing - and they always change timing - the workflow either silently double-processes or silently drops events. The platform's "retry on failure" is not the same thing as a queue with proper at-least-once semantics.

Some workflows can live on platform glue forever. Some absolutely cannot. The ones that touch billing, identity, fulfillment, or anything regulated belong on a real backend with real queue infrastructure. Laravel + Horizon is one option. There are others. The point is that "no-code is faster" stops being true the third time you have to debug a silent event drop.

The pattern that survives

The automations I've built that ran for five years had a few unglamorous things in common:

Prompts versioned in the repo, tested before deploy.
Output monitoring, not just API monitoring.
Human review step for any irreversible decision, gated by confidence.
Real queue infrastructure, idempotency, dead-letter handling.
Targeted use of AI - for the steps where it actually earned its place - and plain old deterministic code for everything else.

None of that is exciting. None of it appears on a launch announcement. All of it is the difference between a workflow that survives a quarter in production and one that quietly degrades.

If you're shipping AI workflows and you can't tell me which of those five your team has covered, the workflows aren't done. They're at the start of the part nobody wants to do.

More notes from production

Get in touch

Got a workflow that needs to survive longer than a quarter?

Send me a short note about what you're trying to build and where it keeps breaking. I'll reply within a working day.

Most things start with a short email - info@pawon.dev.