The Wrong Question, Answered Perfectly

Three Failure Modes, Not Two

Most discussion of AI failure in legal and medical contexts focuses on two output errors: hallucination (fabrication—the tool invents content with no basis in the source) and confabulation (distortion—the tool misrepresents real content). Both are addressed in detail in Part 1 and Part 2 of this series. Both share a critical quality: they are output errors, and they are detectable by someone who verifies the output against the source.

The third failure mode is categorically different. It is an input error, not an output error—and it produces output that is factually accurate, properly sourced, and impossible to detect as wrong without independent knowledge of what the analysis was supposed to accomplish.

Output Error

Hallucination—fabrication

The tool invents content that does not exist in the source material. Citations that do not exist. Facts not in the record. Findings with no clinical basis.

Catchable by: verifying output against the source document.

Output Error

Confabulation—distortion

The tool takes real content and misrepresents it—blending accurate and inaccurate elements, presenting partial information as complete, filling gaps with plausible-sounding content the record does not support.

Catchable by: careful verification of claims against the original source, noting what the record does and does not say.

Input Error

Prompt-based downstream error—task misspecification

The tool is given the wrong task and executes it correctly. The citations are real. The quotes are accurate. The document looks thorough and professional. The error lives entirely in the framing—in what the tool was asked to do before it touched the material.

Catchable by: someone who independently understands the theory of the case—which somewhat defeats the purpose of using the tool in the first place.

What Task Misspecification Looks Like in Practice

Consider a products liability case involving a failed medical implant. The case has two potential theories: manufacturing defect—this specific unit deviated from the manufacturer's own specifications at the time it was made—and design defect—the product was built exactly as designed, but the design itself was unreasonably dangerous. These are distinct legal theories with distinct elements, distinct discovery targets, and distinct deponents. A manufacturing defect case focuses on the production process, quality control, deviation from specification, and what happened in the factory. A design defect case focuses on the choices made long before manufacturing began—the engineering decisions, the risk-benefit analysis, the alternative designs considered and rejected, the regulatory submissions.

An AI tool is asked to prepare deposition questions for the manufacturer's corporate representative. The prompt does not specify which theory the case is being pursued under. The tool is given the case materials—the complaint, medical records, and discovery responses—and asked to generate deposition questions.

The tool returns a thorough, well-organized document. The questions are detailed, properly sourced to the case materials, and formatted like professional deposition preparation work product. The questions focus on the production line—batch records, inspection logs, deviation reports, quality control procedures, the chain of custody for this specific unit from manufacture through packaging and delivery.

None of this is wrong in a factual sense. These are legitimate questions. They would be exactly right in a manufacturing defect case. The associate who ran the AI tool had no way to recognize the misalignment without independent knowledge of the theory driving the case—the document looked complete, the questions were grounded in the case materials, and it went up the chain as finished deposition prep. The senior attorney who received it—the one who had been working the case for months, who knew the theory, who understood exactly what the deposition needed to accomplish—read through it and immediately recognized that every question in the document was aimed at the wrong liability theory. The design defect issues that drove the case were untouched. The prep was worthless. And the problem was not that the AI made something up—it was that no one upstream of the tool had told it what the case was actually about.

Both theories involve the same device. Both involve the same company. A document organized around one looks almost exactly like a document organized around the other—until you try to use it at trial and realize the record you built does not support the theory you are arguing.

That is what prompt-based downstream error looks like in practice. It is not a hallucinated specification. It is not a misrepresented test result. It is a well-executed answer to a question that was never the right question—produced by a tool that had no way to know the difference and no basis for flagging its own misalignment. The tool did exactly what it was asked. That was the problem.

Why It Is the Most Dangerous of the Three

Hallucination, at its worst, produces a fabricated case citation that opposing counsel catches at argument, or a clinical finding that does not appear anywhere in the record and fails on the first attempt at verification. These failures are embarrassing and potentially sanctionable, but they are detectable—the error is visible to anyone who checks the source.

Confabulation is more dangerous because it is partly correct. The case exists, but the holding is mischaracterized. The finding is real, but the significance is overstated. Detection requires careful comparison—not just confirming that the source exists, but confirming that what the output claims about it is accurate. Still, a rigorous verification process catches it.

Prompt-based downstream error requires something that verification cannot provide: independent knowledge of what the analysis was supposed to accomplish. An attorney who receives a well-formatted, thoroughly cited deposition preparation document organized around the wrong theory cannot detect the error by checking the citations. The citations check out. The error is not in what the document says—it is in what the document was built to do. Detecting it requires knowing, before reading the document, what the deposition needed to accomplish. Which means the attorney must have done the strategic analysis the tool was supposed to assist with.

This is the failure mode that is most likely to go undetected in a time-pressured litigation environment. A thorough-looking document on a tight timeline is difficult to question. The work product looks complete. The citations look right. And the attorney who trusts it is not just missing the right questions—they may be signaling the wrong theory to a witness who is now prepared to answer it.

The Full Taxonomy of Input Errors

Task misspecification is one of several categories of input error that produce accurate but misleading output. Understanding the full range is useful for evaluating where AI-assisted analysis is being deployed and what kind of oversight is required at each stage.

Task misspecification

The tool is given the wrong task and executes it correctly. The output is accurate and responsive—to a question that was not the right question. Nothing in the output signals the misalignment.

Requires: independent understanding of case theory upstream of the tool.

Incomplete context

The tool is given the right task but insufficient material to complete it well. It fills the gaps with general knowledge, prior patterns, or plausible assumptions rather than flagging that it lacks what it needs. The output looks complete but is built partly on inference.

Requires: knowing what the complete record set should contain—and what is missing from it.

Context contamination

The tool is given the right task and sufficient material, but the material itself contains errors, biases, or framings that propagate forward. If the document uploaded contains a mischaracterization, the tool may absorb and reproduce it rather than interrogating it.

Requires: clinical knowledge to recognize when the source material's framing is itself the problem.

Role misspecification

The tool is given the right task and right material but from the wrong perspective. Asked to analyze a transcript without being told which party the attorney represents, or without the adversarial lens that a deposition prep document requires, the output may be technically responsive but strategically misaligned.

Requires: explicitly establishing perspective, adversarial stance, and strategic objective before the tool engages with the material.

Scope underspecification

The tool is given the right task, right material, and right perspective, but the boundaries are too vague. The tool either over-generates in irrelevant directions or under-generates by staying too narrow. Both produce output that misses what the assignment required.

Requires: precise scoping of what the output must cover and what it must exclude—including what prior ground has already been covered.

Ordering and emphasis errors

The tool weights the material incorrectly, surfacing minor points prominently and burying major ones. This is particularly common when the most important content is not in the most prominent location—buried in a footnote, in a brief exchange late in a long transcript, or in an amendment to earlier testimony.

Requires: clinical judgment to recognize which findings actually matter—regardless of where they appear in the document.

Audience misspecification

The tool generates output calibrated for the wrong reader—framed for a generalist when the reader is a subject matter expert, or written at a depth that does not match what the document will be used for.

Requires: specifying the intended reader and intended use before the tool generates output.

When All Three Fail at Once

Each failure mode described above is serious on its own. The more alarming scenario—and the one most likely to occur in practice—is when they compound. A single AI-generated work product can fail in all three ways simultaneously, and none of the three failures will announce itself in the document.

Consider what that looks like in a single medical record review. The record set provided to the tool is missing three hospitalizations that occurred at a different facility—a gap the tool does not flag because it has no way to know what is absent. That is incomplete context. The tool processes the records it has and, because the decedent was young and the pattern of care looked consistent with a particular diagnosis, fills in a clinical picture that is partly inferred rather than documented. That is confabulation. And the prompt asks the tool to evaluate the case for nursing negligence, when the actual theory—the one the records support—is physician failure to diagnose. The tool produces a thorough nursing-focused analysis of a case that is actually about a physician. That is task misspecification.

The document that comes back looks complete. The clinical findings are cited. The nursing entries are real. The analysis is well-structured. There is no hallucinated fact that fails on first verification, no fabricated citation that opposing counsel catches at argument. There are three simultaneous failures—an incomplete record set, distorted output, and the wrong analytical framework—producing a document wrong in ways that reinforce each other, and only visible to someone who already knows what the complete record contains and what theory it supports.

This is the scenario that makes AI oversight not optional but essential at every stage. Not just at the output—verifying citations, checking findings against the record. At the input—ensuring the record set is complete, the theory is correctly specified, the adversarial lens is applied, and the scope is right. And at the interpretive layer—a physician who has read the underlying records, understands what is missing, and can recognize when the output's confidence is not matched by the material's completeness.

Any one of these failure modes, undetected, can harm a case. All three together, in the same document, in a time-pressured environment where the work product looks complete and the deadline is tomorrow, can be unrecoverable.

The Common Thread

Every input error category shares the same quality: it produces output that looks complete, professional, and responsive. Unlike hallucination, none of them announce themselves. They all require the human reviewer to bring something the tool does not have—knowledge of what is missing, awareness of the theory, understanding of what the material actually contains, or judgment about what actually matters.

That is precisely the kind of oversight that gets skipped when the output looks good and time is short. A well-formatted, thoroughly cited AI document in a time-pressured environment does not invite scrutiny. It invites use.

The physician who directs AI-assisted medical record review brings exactly the kind of upstream judgment that prevents these errors from propagating. The theory of the case shapes what the tool is asked to find. The clinical context shapes how the findings are weighted. The record is read the way a physician reads a chart—not as a document to be processed, but as a set of clinical decisions to be interrogated. That framing, applied before the AI touches the material, is what separates useful analysis from a polished document organized around the wrong question.

Physician-directed AI record review.

AI-assisted methodology with physician oversight at every step—from the prompt that shapes what the tool looks for, to the verification of every finding before it reaches you. Case screening delivered within 48–72 hours of completed record receipt. Flat fee. No commitment beyond the initial engagement.

Start a consultation

The Wrong Question,Answered Perfectly