How I Decide Whether AI Output Deserves My Trust

Language:

From Prompting to Supervision

The first time this really clicked for me was while reviewing German translation output from the custom Gemini-powered plugin I built for Christian Pure. On the surface, everything looked fine. The system had produced text, the rows were filled, and the admin table showed the work as complete. But when I spot-checked a few posts, some of the “German” output was still English. Not clumsy translation. Not debatable phrasing. Literal English strings sitting in fields the system had marked as done.

That was a useful failure because it exposed one of the most dangerous properties of AI-assisted work: output can look finished before it has done the job. A filled field is not a translation. A rendered screen is not a product. A passing first impression is not the same as correctness. AI is very good at producing artifacts that resemble progress — translated-looking rows, plausible code, coherent screens, polished summaries — and if you are tired, rushed, or flattered by the speed, it is easy to accept resemblance as reality.

That failure changed how I work. I stopped treating prompting as the main skill. The real shift was from prompting to supervision: defining acceptance criteria before generation, grounding the model in real system constraints, and building checks that make fluent failure visible before it reaches users.

There is a word showing up more often now for this layer: the harness. The model generates, reasons, and calls tools, but the harness decides what context it sees, what tools are available, what constraints are enforced, what gets logged, when retries stop, and how failures become visible. Viv Trivedy put the formula cleanly: if you’re not the model, you’re the harness. A harness is the practical scaffolding around the model: CLAUDE.md, validators, hooks, cost caps, diff review, CI, staging, and real-device checks.

I did not call my Christian Pure workflow a harness at the time, but that is what it became: Gemini for generation, a theological glossary for constraints, post-translation checks for English carryover, API logging and spend caps for failure detection, and an admin review path before publishing.

Most of my AI work sits in the messy middle between design, product, and implementation: a children’s app with purchase flows and multi-language audio, a WordPress translation system covering thousands of posts across multiple languages, and a room visualizer where spatial mistakes are product mistakes. In that kind of work, “looks good” is not enough. I need to know what source of truth the model used, what constraints it had to preserve, what changed in the diff, what checks passed, and what failure would look like if the model were wrong.

The real job, not the visible output

For Christian Pure, the job was never “generate text in another language.” The job was to make a large multilingual site feel coherent without turning administration into a permanent cleanup operation. If a German page still contains English carryover, the system has not translated the page. It has produced non-empty output.

That distinction matters because AI systems are very good at satisfying the easiest visible requirement. The field is filled. The screen renders. The answer has structure. But the real job usually lives one layer deeper. Did the translation preserve meaning? Did the code fix the bug that actually mattered? Did the design support the user’s real environment, not just the demo state?

After that translation failure, I started treating AI output as untrusted intermediate work. For the plugin, that meant explicit post-translation checks for untranslated English strings, tighter review of completed rows, and closer monitoring of API behavior. The work stopped being “ask the model to translate” and became “design a workflow where bad translations have somewhere to surface.”

The same pattern shows up in code. A model can generate Flutter code that compiles but ignores the RevenueCat edge case that caused the purchase bug. It can produce a clean implementation plan that misses the actual constraint. It can give you a screen that looks right in the normal development path but fails on an older Android device or under lock-screen audio behavior. Before I trust the output, I restate the job in plain language and ask whether the artifact actually did that job, not a cleaner or easier version of it.

Constraints have to be explicit

A lot of AI-generated work fails because it satisfies the visible request while violating the real constraints underneath it: the approval flow, the design tokens, the cost ceiling, the content rules, the device constraints, the architecture, the maintenance burden.

The translation plugin taught me this at real cost. The first version had too little instrumentation around errors, retries, and rate limits. It hit Gemini’s API limits, failed quietly, retried too aggressively, and burned through roughly $250 before I caught it. The output still looked normal enough at a glance. There was no dramatic crash. The system just kept trying, failing, and billing.

The lesson was not that I needed a more magical model. The lesson was that I had handed the model a cleaner problem than the one I actually had. The real problem was not only translation quality. It was translation quality under cost constraints, rate limits, theological terminology rules, admin review needs, and failure modes that had to be visible.

So I changed the way I frame non-trivial AI tasks. Before generation starts, I try to spell out four things: what must stay fixed, what the source of truth is, what failure patterns to watch for, and how I will know before production if something is off. For Christian Pure, that meant a theological glossary with hard-locked terms, language-specific avoid lists, and checks for untranslated English after the model returned output. The model was still useful, but it was no longer operating in a vacuum.

I saw the same issue in Halo. I asked Claude Code for help with a specific Android audio focus bug around lock-screen behavior. The plan that came back was ambitious and internally consistent, but it wanted a broad architectural overhaul when the right answer was narrower. That is a particular kind of AI drift: the model solves a cleaner version of the problem by widening the scope. These days, when a narrow fix turns into a grand redesign, I treat that as a warning sign.

The model has to reference the system, not just resemble it

One of the easiest AI failures to miss is work that looks like it belongs but is not actually connected to the system. The names feel right. The component shape is familiar. The copy sounds close. But when you inspect it, the output is not using the real tokens, the real files, the real APIs, or the real decisions already made.

That is why I care about boring setup files and project conventions. In Halo, a CLAUDE.md file encodes the rules the model needs to respect: project structure, coding conventions, architecture notes, and things that should not be casually rewritten. That kind of context is not glamorous, but it reduces the odds that the model produces a plausible outsider — code that resembles the app without really belonging to it.

I also try not to trust prose summaries of code changes. I want the diff. I want to see which files changed, which imports appeared, whether the model touched more than it needed to, and whether the solution fits the existing pattern. For Flutter work, I use Claude Code hooks to run checks like dart analyze, dart format, and flutter test around edits, with GitHub Actions as another gate before changes land. Those checks do not replace judgment, but they keep easy mistakes from consuming all of it.

That is the practical difference between “using AI” and supervising AI. I am not just asking for output. I am shaping the environment around the output: the source files it can reference, the constraints it must preserve, the automated checks it has to pass, and the review path it goes through before I trust it.

Provenance matters more when the writing is good

AI is useful partly because it can fill gaps. That same ability makes it dangerous when the work depends on truth. Plausible material and grounded material often arrive in the same tone.

I hit this directly on a Halo case study draft. The first pass had good structure and real details pulled from our conversation history. But a few of the best-sounding sections — a naming story, an art-style decision, a migration moment — had no evidence behind them. The model had smoothed over gaps with plausible narrative. It read well. Parts of it were not true.

That failure changed how I review AI-assisted writing and research. I ask where a claim came from before I ask whether it sounds good. Did it come from my notes, a source, a transcript, a repo, or the model’s attempt to make the story smoother? If a model gives me market numbers, keyword data, or a confident strategic claim, I want a traceable source. If it cannot provide one, I treat the claim as an inference, not a fact.

This applies to code too. When a model references an API, package behavior, or project method, I want to know whether it exists. “Feels like it should exist” is not enough. Plausibility is not provenance, and polished output is often the moment when provenance matters most.

Good demos are not enough

Clean prompts make AI look better than it is. So do clean inputs, clean screenshots, clean test cases, and clean demos. Real use is messier. Users are vague. Content is missing. Devices are small. APIs fail. Permissions get weird. Old state lingers in places nobody remembered to check.

The room-styler project made this obvious. Some generated layouts were not broken in a dramatic way. They had furniture, style, and enough polish to pass a quick glance. But they were still bad product outcomes. Scale was off, guest rooms felt unrealistic, and some layouts looked acceptable until you noticed they blocked circulation. Nothing crashed. The output was just wrong in a way that required product judgment to catch.

That changed how I think about predictable AI mistakes. If a system fails in a pattern, I do not treat each instance as a random annoyance. I ask what guardrail is missing. Should the prompt be more explicit? Should there be a second pass? Should the system check for doorway blockage, unsupported claims, untranslated strings, or cost spikes? The point is not to make AI perfect. The point is to stop being surprised by the same failure twice.

This is where product judgment matters. A room layout that blocks circulation is not merely an aesthetic issue. It breaks the user’s trust in the tool. A purchase bug is not merely a technical issue. It touches money and trust. A mistranslated religious term is not merely a copy issue. It changes meaning. The checks you build should follow the risks users actually feel.

Failure has to be visible

Silent failure is the part that worries me most. Conventional software can fail silently too, but its failure modes are often easier to localize because the system is more deterministic. AI systems are harder because the output can remain fluent and well-formed even when the operation underneath has failed. The format stays clean, the tone stays calm, and the surface remains convincing.

Before I trust an AI workflow, I want to know how I will notice when it breaks. For the translation plugin, that means logging API calls and costs, inspecting retry behavior, capping spend, and flagging output that still appears to contain English. The point is not only to recover from failure. The first job is to detect it.

For Halo, the equivalent is a layered review path. Static checks catch some problems. Tests catch others. A focused reviewer pass is useful for changes that touch purchase logic, audio behavior, or other high-risk areas. Real-device testing matters because emulator-clean behavior is not the same as user-clean behavior, especially on older Android hardware.

None of this is exciting in the way a fast prototype is exciting. It is logs, tests, diffs, review queues, CI gates, staging, cost caps, retry inspection, and device checks. But that is the scaffolding that keeps AI output from quietly becoming a liability.

Review should scale with risk

I do not apply the same level of scrutiny to everything. That would be slow and miserable. Throwaway copy gets a light pass. A small UI tweak gets checked for fit and obvious nonsense. But anything involving money, user trust, production data, published claims, or a core product flow gets reviewed much more aggressively.

For low-stakes output, I mostly ask whether it looks sane and fits the system. For medium-stakes work, I add questions about whether it solved the real job, respected the constraints, and handled ugly cases. For high-stakes work — RevenueCat purchase code, multilingual publishing, anything that touches real customers or real money — I want grounding, failure detection, cost awareness, and a maintenance cost I would still accept next quarter.

That last part is where a lot of AI output quietly fails. The file works, but it is bloated. The UI is close, but the structure is wrong. The logic handles the demo but will be painful to extend. First-order solutions with second-order messes are borrowed time, and borrowed time compounds.

What changed

I still like speed. I still like polish. I still like getting a strong first draft in minutes instead of hours. But I trust those signals less than I used to because I have seen how often they arrive before reliability.

The useful question for me is no longer “Can AI do this?” It can almost always produce something. The better question is: what would convince me this output deserves to ship?

That question changed the work. I now think less in terms of isolated prompts and more in terms of systems: acceptance criteria, source grounding, project context, automated checks, review paths, cost controls, and real-world testing. The model still matters, but the workflow around the model matters just as much.

The first draft is cheap now. Judgment is not. The real leverage is building enough scaffolding around that judgment that it becomes repeatable instead of something I have to summon from scratch every time.

Related

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *