Beyond the prompt: "smell" test and the hidden cost of verification
How the hidden hurdle to scaling vibe is architecting human nose for error
Imagine this scenario - you are racing a month end deadline, you paste a three-hundred-line Python script into Claude and ask it to tidy the data wrangling loops and sort out the joins. Thirty seconds later the model returns a neat rewrite, the unit tests stay green, and the BI dashboard shows no alarms. A teammate scans the totals and murmurs, "umm, the numbers look light." Ten minutes of digging uncovers the culprit, deep in one aggregation, the model has swapped a SUM for an AVG, trimming every cohort’s revenue by nearly eighty percent. The static linter is pleased, the pipeline stays green, and without that quick human sniff test the mistake would slide straight into the board deck.
I am sure we’ve all encountered some form of this workflow and went hard at the keyboard “I said replace em dashes”, “Please don’t use any article older than 3 months for this research”, “Why can’t you write this simple code without an issue”, often including a swear word or two, not only to feel in control, but also to reinforce to the machine that since you’re paying $25 a month and Twitter says the latest version beat some benchmark, it should operate as intended, especially for a task as simple as the one at hand.
That quiet moment captures the tension Balaji distills when he contrasts typing prompts with verifying answers. Typing costs nothing beyond curiosity; verification costs time, nuance, and often domain-specific instrumentation.
Terence Tao similarly explains the issue with his eye test versus smell test analogy. The eye test checks surface polish, whereas the smell test taps an intuitive sensor built from years of living with the data, the code, or the proofs. (Watch the full interview, it's excellent)
It’s an over simplification, but at heart a large language model is a conditional probability machine - given tokens t₁…tₖ it produces a distribution over tₖ₊₁, an operation repeated until the stop signal fires. The model has no internal table of verified facts, only a multidimensional map that associates contexts with likely continuations. When the distribution flattens because the prompt pushes beyond the support of the training corpus, the sampler must still commit to a token. The gap between the prompt and the training data is bridged with the most statistically compatible guess. In practice, the guess can be a phantom citation, a fabricated function name, or a proof step that skips three lemmas and declares victory. The literature labels the phenomenon hallucination, but the term is slightly misleading. The model does not see a false world, it simply predicts the next token under uncertainty in the only way it knows.
The confidence that accompanies the fabrication is baked in as well. During supervised fine‑tuning the loss function punishes low log‑likelihood on reference tokens, so the network learns to assign high probability mass to fluent sequences. At inference, the sampler often operates in greedy or top‑p mode, concentrating on the upper rim of that distribution. The result is prose that sounds certain even when the epistemic footing is weak. You can clip the temperature or add nucleus randomness, the tone remains smooth because the density of the manifold itself is smooth.
Researchers have been witnessing and measuring these failures across domains. In mathematics the Lean‑Dojo benchmark shows transformer proofs that look impeccable yet omit indispensable sub‑lemmas. In software security audits, tools like CodeQL find synthetic package names slipped into import statements that compile but at runtime resolve to nothing. In scientific writing, there are several ACM and other papers which report a large percentage of citations produced by unrestricted generation pointed to non‑existent papers. The pattern repeats because the underlying mechanism repeats, every time the model ventures outside the core of its training distribution it continues the sentence anyway.
The teams i work with also keep a similar log - we have seen several instances where the models invent Salesforce object names, mislabel ISO date offsets, misunderstand the applicability of SQLs in Presto/Trino/Dremio and hallucinate rate‑limits on APIs that never existed. Each error on its own is small, but the verification cost compounds.
GenAI feels miraculous because it collapses drafting latency - a marketing email, a legal rider, or a database schema arrives in seconds. That miracle ends when the verification cycle starts. Engineering rituals that once belonged to the author now move to the reviewer; integration tests, redlined citations, and adversarial QA pile up. In analytics pipelines we find that every minute saved at generation returns as three minutes spent on validation queries and anomaly plots.
Things are getting much better with newer models featuring advanced reasoning and deep tuning, which go on to set new coding and reasoning benchmarks, but the overall narrative in organisations stays similar - teams report a predictable trajectory. In the first sprint productivity leaps because the backlog is draft heavy. By the fourth sprint the ratio between creation and audit normalizes, sometimes inverting when downstream consumers tighten controls. The tool has not degraded, it's just that the entropy introduced by fast drafts has surfaced.
The bottleneck is being tackled on multiple fronts: ongoing work on neural-symbolic loops which tie a transformer to a proof assistant such as Coq. The model proposes a step, the checker accepts or rejects, and the gradient shifts toward proofs that survive formal scrutiny. Then, there are execution sandboxes which compile every generated function, run unit and property tests, capture traces, and feed the failures back to the generator until all tests pass. Lots of interesting work also happening on retrieval-augmented decoding which pulls vetted passages from a curated index before token sampling, then stores hash links that let readers click-through to the canonical source, a trick that has already cut fabricated facts by double digits in long-published toolkits such Phi-3 and LlamaGuard. Finally, uncertainty and critic layers quantify variance over multiple forward passes and flag the high-entropy spans that merit human review. Each method spends extra compute up front yet replaces ad-hoc eyeballing with systematic, architecture-level checks.
The last bit is what most of the enterprise use cases are leveraging in some form or the other today (and FWIW, it is also a similar foundation I use across most of my builds) - i.e. include leveraging critic models; AI systems trained explicitly to interrogate and validate other models' outputs, and automate preliminary layers of verification. Net net, one model drafts, a second adversary searches for contradictions, a third arbitrates and so on. This approach, however, risks circular reasoning. Without external grounding in independently verified data or execution contexts, critic models may amplify rather than mitigate hallucinations. The technique also yields diminishing returns if the agents share weights, so effective systems randomize initialization or training perspectives. Importantly the arbiter must defer to external ground truth, otherwise the debate becomes closed circuit.
Despite all the issues and ongoing research to solve for those, a persistent human-in-the-loop framework remains the single most proof way to manage and reduce risks; particularly within high-stakes domains involving customer-level solutioning and reasoning, and the risks are more acute in industries such as healthcare, finance, and law. For the most prime verification, human expertise is still paramount and essential to contextual grounding, bridging the gap between AI-generated plausibility and genuine trustworthiness.
The deepest leverage to solve this verification-at-scale problem belongs to the organizations that own the pre‑training stack. Only they can modify the loss to reward verifiable output directly, embed symbolic reasoning layers during training, expose intermediate attention maps for calibration, or fine‑tune on datasets where correctness is measured by formal acceptance rather than by human preference. External validators can audit and provide domain-specific verification (there are tons of repos on Git, and yes quite a number of startups are building domain-specific stack to validate, providing a factual and symbolic reference stack for comparison, and are making money off it), but they cannot re‑shape the inductive bias at the root. Easier for an organisation which owns the existing distribution to add verification, than a very strong verification-only platform, which has to scale and capture the inference stack, and at the same time also grab distribution. In this scenarios, economic incentives align as well - deployment stalls when the inference owning enterprises realize that each hallucination carries legal or reputational liability. Providers that solve verification at the architectural level will unlock the next adoption wave. Trust, like latency, is a platform metric.
The lure of instant fluency is real, yet intelligence is more than the absence of grammatical error - IMO, it is fidelity to reality and a willingness to expose uncertainty. Until machines can offer that fidelity natively, humans will keep a hand on the circuit breaker. And from where i see things in practice, the role shift is healthy: we move from clerical drafting to higher order judgment, supervising an eager assistant that is brilliant, tireless, and occasionally delusional. It’ll sound cliche, but symbiosis does beat substitution - we prompt, the model drafts, our tests verify, the model learns the correction, and the loop tightens. Productivity rises again, this time on a foundation of trust, not wishful acceptance. In that future the smell test will still matter, but it becomes a final flourish instead of the last line of defense.