Three Failure Modes, Not Two
Most discussion of AI failure in legal and medical contexts focuses on two output errors: hallucination (fabrication—the tool invents content with no basis in the source) and confabulation (distortion—the tool misrepresents real content). Both are addressed in detail in Part 1 and Part 2 of this series. Both share a critical quality: they are output errors, and they are detectable by someone who verifies the output against the source.
The third failure mode is categorically different. It is an input error, not an output error—and it produces output that is factually accurate, properly sourced, and impossible to detect as wrong without independent knowledge of what the analysis was supposed to accomplish.
What Task Misspecification Looks Like in Practice
Consider a products liability case involving a failed medical implant. The case has two potential theories: manufacturing defect—this specific unit deviated from the manufacturer's own specifications at the time it was made—and design defect—the product was built exactly as designed, but the design itself was unreasonably dangerous. These are distinct legal theories with distinct elements, distinct discovery targets, and distinct deponents. A manufacturing defect case focuses on the production process, quality control, deviation from specification, and what happened in the factory. A design defect case focuses on the choices made long before manufacturing began—the engineering decisions, the risk-benefit analysis, the alternative designs considered and rejected, the regulatory submissions.
An AI tool is asked to prepare deposition questions for the manufacturer's corporate representative. The prompt does not specify which theory the case is being pursued under. The tool is given the case materials—the complaint, medical records, and discovery responses—and asked to generate deposition questions.
The tool returns a thorough, well-organized document. The questions are detailed, properly sourced to the case materials, and formatted like professional deposition preparation work product. The questions focus on the production line—batch records, inspection logs, deviation reports, quality control procedures, the chain of custody for this specific unit from manufacture through packaging and delivery.
None of this is wrong in a factual sense. These are legitimate questions. They would be exactly right in a manufacturing defect case. The associate who ran the AI tool had no way to recognize the misalignment without independent knowledge of the theory driving the case—the document looked complete, the questions were grounded in the case materials, and it went up the chain as finished deposition prep. The senior attorney who received it—the one who had been working the case for months, who knew the theory, who understood exactly what the deposition needed to accomplish—read through it and immediately recognized that every question in the document was aimed at the wrong liability theory. The design defect issues that drove the case were untouched. The prep was worthless. And the problem was not that the AI made something up—it was that no one upstream of the tool had told it what the case was actually about.
That is what prompt-based downstream error looks like in practice. It is not a hallucinated specification. It is not a misrepresented test result. It is a well-executed answer to a question that was never the right question—produced by a tool that had no way to know the difference and no basis for flagging its own misalignment. The tool did exactly what it was asked. That was the problem.
Why It Is the Most Dangerous of the Three
Hallucination, at its worst, produces a fabricated case citation that opposing counsel catches at argument, or a clinical finding that does not appear anywhere in the record and fails on the first attempt at verification. These failures are embarrassing and potentially sanctionable, but they are detectable—the error is visible to anyone who checks the source.
Confabulation is more dangerous because it is partly correct. The case exists, but the holding is mischaracterized. The finding is real, but the significance is overstated. Detection requires careful comparison—not just confirming that the source exists, but confirming that what the output claims about it is accurate. Still, a rigorous verification process catches it.
Prompt-based downstream error requires something that verification cannot provide: independent knowledge of what the analysis was supposed to accomplish. An attorney who receives a well-formatted, thoroughly cited deposition preparation document organized around the wrong theory cannot detect the error by checking the citations. The citations check out. The error is not in what the document says—it is in what the document was built to do. Detecting it requires knowing, before reading the document, what the deposition needed to accomplish. Which means the attorney must have done the strategic analysis the tool was supposed to assist with.
This is the failure mode that is most likely to go undetected in a time-pressured litigation environment. A thorough-looking document on a tight timeline is difficult to question. The work product looks complete. The citations look right. And the attorney who trusts it is not just missing the right questions—they may be signaling the wrong theory to a witness who is now prepared to answer it.
The Full Taxonomy of Input Errors
Task misspecification is one of several categories of input error that produce accurate but misleading output. Understanding the full range is useful for evaluating where AI-assisted analysis is being deployed and what kind of oversight is required at each stage.
When All Three Fail at Once
Each failure mode described above is serious on its own. The more alarming scenario—and the one most likely to occur in practice—is when they compound. A single AI-generated work product can fail in all three ways simultaneously, and none of the three failures will announce itself in the document.
Consider what that looks like in a single medical record review. The record set provided to the tool is missing three hospitalizations that occurred at a different facility—a gap the tool does not flag because it has no way to know what is absent. That is incomplete context. The tool processes the records it has and, because the decedent was young and the pattern of care looked consistent with a particular diagnosis, fills in a clinical picture that is partly inferred rather than documented. That is confabulation. And the prompt asks the tool to evaluate the case for nursing negligence, when the actual theory—the one the records support—is physician failure to diagnose. The tool produces a thorough nursing-focused analysis of a case that is actually about a physician. That is task misspecification.
The document that comes back looks complete. The clinical findings are cited. The nursing entries are real. The analysis is well-structured. There is no hallucinated fact that fails on first verification, no fabricated citation that opposing counsel catches at argument. There are three simultaneous failures—an incomplete record set, distorted output, and the wrong analytical framework—producing a document wrong in ways that reinforce each other, and only visible to someone who already knows what the complete record contains and what theory it supports.
This is the scenario that makes AI oversight not optional but essential at every stage. Not just at the output—verifying citations, checking findings against the record. At the input—ensuring the record set is complete, the theory is correctly specified, the adversarial lens is applied, and the scope is right. And at the interpretive layer—a physician who has read the underlying records, understands what is missing, and can recognize when the output's confidence is not matched by the material's completeness.
Any one of these failure modes, undetected, can harm a case. All three together, in the same document, in a time-pressured environment where the work product looks complete and the deadline is tomorrow, can be unrecoverable.
The Common Thread
Every input error category shares the same quality: it produces output that looks complete, professional, and responsive. Unlike hallucination, none of them announce themselves. They all require the human reviewer to bring something the tool does not have—knowledge of what is missing, awareness of the theory, understanding of what the material actually contains, or judgment about what actually matters.
That is precisely the kind of oversight that gets skipped when the output looks good and time is short. A well-formatted, thoroughly cited AI document in a time-pressured environment does not invite scrutiny. It invites use.
The physician who directs AI-assisted medical record review brings exactly the kind of upstream judgment that prevents these errors from propagating. The theory of the case shapes what the tool is asked to find. The clinical context shapes how the findings are weighted. The record is read the way a physician reads a chart—not as a document to be processed, but as a set of clinical decisions to be interrogated. That framing, applied before the AI touches the material, is what separates useful analysis from a polished document organized around the wrong question.
Physician-directed AI record review.
AI-assisted methodology with physician oversight at every step—from the prompt that shapes what the tool looks for, to the verification of every finding before it reaches you. Case screening delivered within 48–72 hours of completed record receipt. Flat fee. No commitment beyond the initial engagement.
Start a consultation