How I Decide Whether AI Output Deserves My Trust

From Prompting to Supervision

The first time this clicked for me, I was reviewing German translations from a custom Gemini-powered plugin I had built for Christian Pure. The admin table looked normal. Rows were filled. Jobs were marked complete. From a distance, the system looked like it had done the work.

Then I opened a few posts and found English sitting inside fields that were supposed to contain German. The failure was not subtle. It was not a matter of taste, fluency, or phrasing. The workflow had marked unfinished work as finished, and the interface had made that failure look calm.

That mistake changed how I think about AI-assisted work. A filled field is not a translation. A rendered screen is not a product. A coherent answer is not the same thing as a correct answer. AI is very good at producing artifacts that resemble progress: translated-looking rows, plausible code, tidy screens, polished summaries. When I am tired or moving quickly, resemblance can be enough to fool me for a moment.

That is when I stopped thinking of prompting as the center of the skill. Prompting still matters, but the more important work is supervision: deciding what success means before generation starts, grounding the model in the real system, and creating checks that make fluent failure visible before it reaches users.

People sometimes call this surrounding layer a harness. I did not have that word in mind when I built the Christian Pure workflow, but that is what it became. Gemini generated the translations. A theological glossary constrained key terms. Post-translation checks looked for English carryover. API logs and spend caps made failure easier to detect. Admin review kept output away from production until someone could inspect it.

Most of my AI work lives in that messy middle between product, design, and implementation. I have worked on a children’s bedtime app with purchase flows and multilingual audio, a WordPress translation system covering thousands of posts, and a room visualizer where spatial mistakes break the promise of the product. In that kind of work, looking finished is a weak signal. I need to know which source of truth the model used, which constraints it preserved, what changed in the diff, what checks passed, and what failure would look like if the model were wrong.

The job is deeper than the visible output

For Christian Pure, the job was larger than generating text in another language. The actual job was to help a large multilingual site feel coherent without turning administration into permanent cleanup work. If a German page still contains English carryover, the system has produced non-empty output, not a translated page.

That distinction shows up everywhere. A model can fill a field, render a screen, or return a structured answer while missing the point of the work. The deeper question is whether the artifact did the job that matters. Did the translation preserve meaning? Did the code fix the bug that caused the real failure? Did the design support the user’s real environment rather than the clean demo path?

After the translation failure, I started treating AI output as untrusted intermediate work. For the plugin, that meant explicit checks for untranslated English strings, tighter review of completed rows, and closer monitoring of API behavior. The workflow stopped being a request for translation and became a system where bad translations had somewhere to surface.

The same pattern shows up in code. A model can generate Flutter code that compiles while ignoring the RevenueCat edge case that caused the purchase bug. It can produce a neat implementation plan that misses the actual constraint. It can build a screen that looks fine in the normal development path and then fails on an older Android device or during lock-screen audio behavior. Before I trust the output, I restate the job in plain language and check whether the artifact solved that job, including the ugly parts.

Constraints have to be explicit

A lot of AI-generated work fails because the visible request is easier than the real constraint underneath it. The model can satisfy the instruction while violating the approval flow, design tokens, cost ceiling, content rules, device constraints, architecture, or maintenance burden.

The translation plugin taught me this at real cost. The first version had too little instrumentation around errors, retries, and rate limits. It hit Gemini’s API limits, failed quietly, retried too aggressively, and burned through roughly $250 before I caught it. There was no dramatic crash. The output still looked normal enough at a glance. The system kept trying, failing, and billing.

The lesson was practical. I had handed the model a cleaner problem than the one I actually had. The real problem included translation quality under cost limits, rate limits, theological terminology rules, admin review needs, and failure modes that needed to be visible. Once I understood that, I changed how I frame non-trivial AI tasks.

Before generation starts, I try to write down four things: what must stay fixed, what source of truth the model should use, which failure patterns deserve attention, and how I will know before production if something is off. For Christian Pure, that meant hard-locked glossary terms, language-specific avoid lists, and checks for untranslated English after the model returned output. The model was still useful, and it was finally working inside the real shape of the problem.

I saw the same issue in Halo. I asked Claude Code for help with a specific Android audio focus bug around lock-screen behavior. The plan it returned was ambitious and internally consistent, yet it widened a narrow fix into a broad architecture project. That is a specific kind of AI drift: the model solves a cleaner version of the problem by expanding the scope. When a small repair turns into a grand redesign, I now treat that as a warning sign.

The output has to belong to the system

One of the easiest failures to miss is work that looks like it belongs while staying disconnected from the actual system. The names feel right. The component shape is familiar. The copy sounds close. Then you inspect the result and find that it ignored the real tokens, files, APIs, components, or decisions already in the product.

Boring setup files and project conventions matter for that reason. In Halo, a CLAUDE.md file gives the model rules it needs to respect: project structure, coding conventions, architecture notes, and areas where casual rewrites are dangerous. This kind of context will never make an impressive demo, but it reduces the odds that the model produces plausible outsider code that merely resembles the app.

I also avoid trusting prose summaries of code changes. I want the diff. I want to see which files changed, which imports appeared, whether the model touched more than the task required, and whether the solution fits the existing pattern. For Flutter work, I use Claude Code hooks to run checks like dart analyze, dart format, and flutter test around edits, with GitHub Actions as another gate before changes land. Those checks do not replace judgment. They keep basic mistakes from consuming all of it.

That is the practical difference between using AI casually and supervising AI seriously. I am shaping the environment around the output: the source files it can reference, the constraints it must preserve, the checks it has to pass, and the review path it goes through before I trust it.

Provenance matters most when the writing sounds good

AI is useful because it can fill gaps. The same ability makes it dangerous when the work depends on truth. Plausible material and grounded material often arrive in the same confident tone, which means the writing can feel most convincing exactly when it needs the most inspection.

I hit this while drafting a Halo case study. The first pass had good structure and included real details from our conversation history. A few of the best-sounding sections, though, had no evidence behind them: a naming story, an art-style decision, a migration moment. The model had smoothed over missing information with plausible narrative. It read well, and parts of it were not true.

That failure changed how I review AI-assisted writing and research. I ask where a claim came from before I decide whether it sounds good. Did it come from my notes, a transcript, a repo, a source, or the model’s attempt to make the story smoother? If a model gives me market numbers, keyword data, or a confident strategic claim, I want a traceable source. If it cannot provide one, I treat the claim as an inference.

This applies to code as well. When a model references an API, a package behavior, or a project method, I want to know whether it exists. “Feels like it should exist” is a bad reason to trust anything. Plausibility is not provenance, and polished output is often where provenance matters most.

Good demos hide weak spots

Clean prompts make AI look better than it is. Clean screenshots, clean inputs, clean tests, and clean demos do the same thing. Real use is messier. Users are vague. Content is missing. Devices are small. APIs fail. Permissions get strange. Old state lingers in places nobody remembered to check.

The room-styler project made this obvious. Some generated layouts were not broken in a dramatic way. They had furniture, style, and enough polish to pass a quick glance. The problem was product judgment: scale was off, guest rooms felt unrealistic, and some layouts looked acceptable until you noticed they blocked circulation. Nothing crashed. The output was simply wrong in a way that a generic visual check would miss.

That changed how I think about predictable AI mistakes. If a system fails in a pattern, I try not to treat each instance as a fresh annoyance. I ask what guardrail is missing. Does the prompt need to be more explicit? Should there be a second pass? Should the system check for doorway blockage, unsupported claims, untranslated strings, cost spikes, or broadened scope? I do not expect AI to become perfect. I do expect my workflow to stop being surprised by the same failure twice.

The checks should follow the risks users actually feel. A room layout that blocks circulation breaks trust in the tool. A purchase bug touches money. A mistranslated religious term changes meaning. Those failures belong in different risk categories, and the review process should treat them differently.

Failure has to be visible

Silent failure is the part that worries me most. Conventional software can fail silently too, but its failure modes are often easier to localize because the system is more deterministic. AI systems are harder because the output can stay fluent and well-formed even when the operation underneath has failed. The format stays clean, the tone stays calm, and the surface remains convincing.

Before I trust an AI workflow, I want to know how I will notice when it breaks. For the translation plugin, that means logging API calls and costs, inspecting retry behavior, capping spend, and flagging output that still appears to contain English. Detection comes before recovery. If I cannot see the failure, I cannot fix it in time.

For Halo, the equivalent is a layered review path. Static checks catch some problems. Tests catch others. A focused reviewer pass is useful for changes that touch purchase logic, audio behavior, or other high-risk areas. Real-device testing matters because emulator-clean behavior can diverge from user-clean behavior, especially on older Android hardware.

None of this has the thrill of a fast prototype. It is logs, tests, diffs, review queues, CI gates, staging, cost caps, retry inspection, and device checks. It is also the scaffolding that keeps AI output from quietly turning into a liability.

Review should scale with risk

I do not apply the same level of scrutiny to everything. That would make the work slow and miserable. Throwaway copy gets a light pass. A small UI tweak gets checked for fit and obvious nonsense. Anything involving money, user trust, production data, published claims, or a core product flow gets reviewed much more aggressively.

For low-stakes output, I mostly ask whether it looks sane and fits the system. For medium-stakes work, I add questions about whether it solved the real job, respected the constraints, and handled the ugly cases. For high-stakes work, such as RevenueCat purchase code, multilingual publishing, or anything that touches real customers and real money, I want grounding, failure detection, cost awareness, and a maintenance cost I would still accept next quarter.

That last point matters because a lot of AI output fails quietly after the first order of work. The file works, yet it is bloated. The UI is close, yet the structure is wrong. The logic handles the demo, then becomes painful to extend. Short-term speed can create long-term drag if I accept the first working answer too easily.

What changed for me

I still like speed. I still like polish. I still like getting a strong first draft in minutes instead of hours. I trust those signals less than I used to because I have seen how often they arrive before reliability.

The question I ask now is simple: what would convince me this output deserves to ship? That question moves the work away from isolated prompts and toward systems: acceptance criteria, source grounding, project context, automated checks, review paths, cost controls, and real-world testing. The model still matters, but the workflow around the model is where trust is earned.

A first draft is easier to get than it used to be. The harder work is building enough scaffolding around my judgment that I can repeat it when I am tired, moving quickly, or staring at output that looks more finished than it really is.

Related

How I Ship Real Products with AI-Assisted Product Design

Halo: Family Bedtime Routines

Comments

Leave a Reply Cancel reply