Operational notes

Human-in-the-loop workflows: the part everyone skips

The pitch for AI-assisted workflows is usually some version of "remove the human from the loop." The version that actually holds up in production almost always puts a human back into it - at exactly one step, with a specific job, with a specific UI.

When the human step is missing, you get the failure modes from the previous article. When it's there but designed badly, you get a different problem: a workflow technically running, technically using AI, technically saving time - but bottlenecked on someone reviewing forty-seven cases a day in a UI that was never designed for review work.

So: where does the human step belong, and how do you build it so it stays a checkpoint instead of a bottleneck?

Where the human step belongs

Three heuristics that have held up across every operational workflow I've shipped:

1. Insert the human where decisions are irreversible

Anything that touches money, customer-facing communication, contracts, identity, or anything that's hard to undo. A refund. An invoice. An email that goes to a customer. A change to a CRM record that affects routing.

The cost of a human looking at it for ten seconds is much lower than the cost of an irreversible AI mistake at scale. Even at high volume, a confidence-gated human review handles the long tail without sitting in the middle of every transaction.

2. Insert the human where confidence is low

The model returns a confidence score. (If it doesn't natively, you can derive one - output stability under temperature, log-probability of the chosen class, consistency across two passes.) Below a threshold, route to a human. Above it, ship.

This is the difference between a workflow where the human reviews everything (slow) and one where the human reviews the 5–15% the model isn't sure about (manageable). It also gives the human something genuinely useful to do - they're not rubber-stamping easy cases, they're working on the edge cases where their judgment actually matters.

3. Insert the human where the data is novel

When the workflow encounters input that doesn't look like anything it's seen before - out-of-distribution detection - pause and ask. New customer segment, new portal source, new product line. Anything where the training data doesn't have an analog.

This one's harder to detect automatically but worth instrumenting. Even a simple embedding-distance check against your training set catches the obvious novelty cases.

How to design the review UI

This is where most workflows fall over. The model is doing good work. The human review step is technically present. But the UI is just a database admin panel with a "approve" and "reject" button, and reviewing a single case takes three minutes.

A useful review UI does four things:

It shows the model's reasoning, not just the output. If the model classified a lead as "hot," show the specific signal it picked up on - the words in the inquiry, the source attribution, the historical pattern match. The human is faster at agreeing or disagreeing with reasoning than at re-deriving the classification from scratch.

It lets the human override partially, not just accept or reject. Maybe the classification is right but the priority tier is wrong. Maybe the extracted phone number is right but the country code isn't. Granular overrides let the human contribute their judgment without redoing the entire job.

It learns from overrides. Every override should be logged with enough structure to be a training signal - even if you're not retraining the model, you're calibrating the confidence threshold, refining the prompt, or feeding into an eval set.

It batches similar cases. Reviewing twelve "is this lead hot?" cases in a row is faster than alternating between unrelated workflows. Group by classification type, by source, by ambiguity pattern. The human's mental context-switch cost is real.

The bottleneck question

The objection is always: "But humans don't scale." Correct. They don't have to.

The math: if the model handles 90% with high confidence and the human handles 10% in a properly designed UI, you've turned what used to be 100 manual cases per day into 10. A 90% reduction is plenty. Trying to push past that - toward fully unattended automation - is usually where the workflow falls apart and you eat the cost of every silent failure.

The workflow that survives in production isn't the fully automated one. It's the one where the model and the human are each doing the job they're best at: the model handling pattern-matching at volume, the human handling edge cases at depth.

When the human step needs to be you, not a junior

For some decisions, the right reviewer isn't a generic ops person. It's the founder, or the senior operator, or me. Pricing exceptions. Strategic communications. Anything where the cost of a bad answer is significantly higher than the time cost of pulling someone senior in.

The mistake is trying to make these decisions cheap. They shouldn't be. They should be rare, batched, and routed to the right person. Workflow done well makes them rare. Workflow done badly makes them frequent.

What this looks like in practice

Most workflow projects I take on involve at least one round of inserting or repairing the human step. Common patterns:

  • Adding a "human review queue" UI on top of an existing classifier, with batching and granular overrides.
  • Adding confidence scoring to an existing model output and a routing rule on top of it.
  • Replacing a "fully automated" workflow that's been silently mis-classifying for months with a 90% / 10% split.
  • Building the data pipeline that turns human overrides into useful telemetry.

None of this is exciting. None of it appears in an AI announcement. All of it is the difference between a workflow that runs for a year and one that has to be replaced every quarter.

The phrase "human-in-the-loop" is doing a lot of unspecified work in most pitches. The interesting question is always which loop, which human, which step, and what does their UI look like. That's where the operational design happens.


More notes from production

Get in touch

Got a workflow that needs to survive longer than a quarter?

Send me a short note about what you're trying to build and where it keeps breaking. I'll reply within a working day.