Unstructured and Missing Data: What Every Clinician Must Know

A doctor writes in a clinical note: “Patient appears more comfortable today, breathing easier, family less anxious.”

This sentence contains clinical information that no structured data field captures. It is not in the lab report. It is not in the vital signs trend. It will not be assigned an ICD code. It does not appear in the medication record. It is exactly the kind of contextual, observational, integrative judgment that distinguishes good clinical documentation from data entry — and it is largely invisible to AI.

That invisibility is not a minor inconvenience. It is a fundamental characteristic of the data landscape in which clinical AI operates. Understanding it — and its implications for how AI tools should be used and evaluated — is one of the most important things a clinician can know about AI in medicine.

The Unstructured Data Problem

Estimates in healthcare informatics consistently suggest that approximately 80% of clinical data is unstructured — existing as free text, medical images, or audio — rather than in the structured fields of electronic health record systems. This figure, widely cited in healthcare AI literature and referenced in IBM and NHS informatics analyses, captures something important about what electronic health records actually contain.

The structured fields in an EHR — vital signs, lab results, diagnostic codes, medication lists, appointment records — are the minority. The clinical note, the radiology report, the discharge letter, the referral correspondence: these are where the actual clinical thinking lives. And for most AI systems, they are opaque.

Natural Language Processing, or NLP, is the technical approach to extracting structured meaning from unstructured text. At its best, NLP can read a radiology report and identify specific findings, their location, and their clinical significance — converting a free-text paragraph into structured data that an AI model can use. Several clinical NLP systems have been validated at performance levels that compare well with human coding for specific, consistent text types.

But the boundary of what NLP handles well is important to understand. Radiology reports follow relatively predictable formats. A “Findings” section followed by an “Impression” section, with domain-specific vocabulary, allows NLP models to extract information reliably. Clinical notes written by individual clinicians are a different challenge entirely.

Clinical notes contain highly idiosyncratic abbreviations. Different clinicians use different shorthand, and the same abbreviation can mean different things in different contexts or specialties. Negation is a particular challenge: “no chest pain” must be identified as the absence of chest pain, not its presence — a distinction that is trivial for a human reader and surprisingly difficult for automated text analysis. Context-dependent meaning is another: a finding that would be significant in one clinical situation may be incidental in another, and the note’s meaning depends on understanding the patient’s broader situation, which may span multiple previous entries.

The richest clinical reasoning — the paragraphs where a consultant integrates weeks of hospital findings, explains a complex diagnostic picture, and articulates a nuanced management plan — is also the hardest for NLP to interpret reliably. The irony is almost perfect: the more sophisticated the clinical thinking, the harder it is for AI to read.

Unstructured and missing data in clinical AI — showing that approximately 80 percent of clinical information is unstructured text including notes, letters, and summaries, while structured EHR fields capture only lab values, vitals, and billing codes, leaving the richest clinical reasoning largely invisible to AI systems — Most clinical information is unstructured — and most AI systems can only read the minority that isn't

Why Missing Data Is Not Random

Beyond what is recorded in unstructured form, there is a more fundamental problem: the clinical information that is never recorded at all.

The instinctive assumption is that missing data in medical records is random noise — gaps that could have gone either way, distributed without pattern. This assumption is wrong, and understanding why it is wrong is one of the most important insights in clinical AI literacy.

Data is missing from medical records for reasons that are systematically related to health outcomes. This is not a coincidence. It is a structural feature of how clinical data is generated.

Consider patients who do not attend follow-up appointments. Their outcomes after a clinical episode are absent from the hospital record — not because nothing happened, but because they were not captured. These patients are not a random subset of the clinical population. They are more likely to have transportation difficulties, work obligations that conflict with appointment times, difficulty navigating health system bureaucracy, or levels of health anxiety that make follow-up feel threatening rather than helpful. These are also factors associated with worse health outcomes. The absence of their data is not random — it correlates with the very outcomes the AI is trying to predict.

Consider tests that are not ordered. The absence of an echocardiogram result in a patient’s record does not mean the patient had no cardiac abnormalities. It may mean that the clinician assessed the clinical picture and judged an echo unnecessary. That clinical judgment — “I didn’t order this because the presentation was clearly non-cardiac” — is information. It is not captured in any structured field. An AI that treats an absent echo result as missing data is missing the clinical reasoning behind the absence.

Consider social determinants of health. Income, housing stability, food security, occupational exposure, social support: these factors have well-documented effects on health outcomes. A large body of research establishes them as among the strongest predictors of readmission, disease progression, and treatment adherence. They are almost never systematically documented in electronic health records. Most EHR systems have no dedicated fields for them, and even where they exist, documentation is inconsistent. An AI trained on EHR data to predict readmission is predicting outcomes while blind to some of the most powerful predictors of those outcomes.

This is not a data quality problem that can be fixed with better record-keeping, though better record-keeping would help. It reflects the fact that clinical data is generated as a by-product of clinical care, not as a systematic research protocol. What gets recorded depends on what the clinical workflow requires, what fields the EHR provides, what clinicians have time to document, and what is reimbursed. This is a fundamentally different process from the prospective data collection of a clinical trial — and clinical AI inherits all its structure and all its gaps.

Missing clinical data is not random — it reflects real patterns that AI systems inherit and can amplify

What This Means for AI Tools in Practice

The practical implications of unstructured and missing data for clinical AI are significant and specific.

An AI readmission prediction model trained on hospital EHR data will perform best on patients whose data looks like the training data: patients who were frequently in contact with the health system, whose records are comprehensive, whose diagnoses were systematically coded, and whose social context happens to match the training population. For these patients, the AI has rich information and a relevant pattern-matching reference.

For patients who are less documented — those who present infrequently, whose notes are brief, whose comorbidities are undercoded, or whose demographic profile was underrepresented in the training data — the AI has less information and less relevant patterns. Its predictions for these patients rest on thinner foundations, but the model may not indicate this. It produces a risk score in the same format for every patient, regardless of how much information it actually has.

This asymmetry has a specific clinical implication: the patients for whom AI risk scores are least reliable are often the patients with the highest actual clinical complexity and the most to gain from accurate risk stratification. The model is most confident where it has the richest data — which is also where clinical intuition is often strongest anyway. It is most uncertain where data is sparse — which is where it might seem most useful.

Understanding this pattern does not mean dismissing AI tools. It means applying them with appropriate calibration: more weight where the tool has rich, representative data, more caution where the clinical picture is complex and the documentation incomplete.

The GIGO Principle in Clinical AI

Computer science has a phrase that applies here with uncomfortable precision: Garbage In, Garbage Out. The quality of an AI model’s outputs is bounded by the quality of its training data. This is not merely a slogan — it is a description of a mathematical constraint. A model cannot learn reliable patterns from unreliable inputs.

In clinical AI, the GIGO principle has several specific manifestations.

ICD diagnostic codes are assigned by clinical coders, not clinicians. They reflect the documentation in the medical record filtered through a coder’s interpretation of that documentation and the requirements of the billing system. Studies of coding accuracy consistently find discrepancies between the coded diagnosis and the clinical reality. A diagnosis that was listed as “possible” in the clinical note may be coded as definitive. A secondary condition that was clinically significant may not be coded at all if the primary diagnosis exhausted the available coding fields. An AI trained on ICD data is learning from this filtered, imperfect representation.

Vital signs recorded at triage or at scheduled nursing observations represent a single timepoint in a continuous physiological process. The vital signs in the record may not reflect the patient’s trajectory between observations. A patient who was tachycardic for three hours between nursing checks and then normalised will appear normocardiac in the record. An AI that relies on documented vital signs is working from a sampled representation of a continuous reality.

Laboratory values are interpreted in clinical context, but that context is in the text, not the number. A haemoglobin of 7.5 g/dL means something different in a patient with chronic kidney disease on long-term erythropoietin who has been stable for years than in a patient presenting acutely with melaena. The number is identical. The clinical meaning is entirely different. An AI that ingests lab values without the contextual information to interpret them is learning patterns that may not transfer reliably to the varied clinical situations where the values appear.

What Clinicians Should Ask

Evaluating an AI tool’s relationship to unstructured and missing data requires asking specific questions. These are not technical questions requiring specialist knowledge — they are the same questions a clinician applies to any clinical evidence.

How was missing data handled in this AI’s validation study? Common approaches include imputation (statistically estimating missing values from available data), complete-case analysis (including only patients with complete records), or simply ignoring missingness. Each approach has different implications. Complete-case analysis may create a validation sample that is systematically different from the real clinical population — precisely because complete records are not a random subset of records. Imputation assumes that missingness is explainable from available data, which may not hold when data is missing for the structured reasons described above.

What percentage of the training data was unstructured text? If the AI was trained exclusively on structured fields, the clinician should understand that it is working from a minority of the available clinical information and ask whether the structured fields available adequately represent the clinical problem being addressed.

Was NLP used to extract information from clinical text, and how was that extraction validated? NLP performance varies significantly by text type and clinical domain. A system validated on radiology reports may not perform with similar accuracy on general clinical notes.

Is the AI’s performance reported separately for patients with sparser versus richer documentation? If a validation study reports only aggregate performance, it may obscure the asymmetry described above — strong performance on well-documented patients masking poor performance on sparsely documented ones.

The Instinct Every Clinician Already Has

There is nothing conceptually foreign in this analysis. Every experienced clinician has encountered the scenario where the data in the chart does not match the clinical reality in front of them. The patient whose chart looks reassuring but who clearly appears unwell. The blood test result that, on paper, looks normal but in context is concerning. The vital signs that are documented as stable but whose trajectory, known to the nursing team, tells a different story.

That instinct — the clinical judgment that says “this data is incomplete or misleading; the picture is different from what the chart shows” — is exactly the instinct that should be applied to AI. The chart is the data the AI has seen. The clinical reality may be richer, different, and more complex.

Understanding unstructured and missing data is not a technical exercise. It is an extension of clinical epistemology — the habit of asking, for any source of information, what it captures, what it misses, and how much weight it deserves in the clinical picture.

Applied to AI, that habit becomes: is the data this AI learned from consistent with the clinical reality of my patients? That question is within every clinician’s reach, and answering it is the beginning of genuinely critical AI use.

For an introduction to the full range of clinical data types and their AI applications, see Types of Clinical Data — and Why It Matters for AI. For a comprehensive introduction to clinical AI literacy, see Why Every Clinician Needs AI Literacy. For US standards on electronic health record data, the Office of the National Coordinator for Health Information Technology provides regulatory context.

Unstructured and Missing Data: What Every Clinician Must Know

The Unstructured Data Problem

Why Missing Data Is Not Random

What This Means for AI Tools in Practice

The GIGO Principle in Clinical AI

What Clinicians Should Ask

The Instinct Every Clinician Already Has

More Perspectives

Types of Clinical Data — and Why It Matters for AI

Why Every Clinician Needs AI Literacy — And Where to Begin

Unstructured and Missing Data: What Every Clinician Must Know

The Unstructured Data Problem

Why Missing Data Is Not Random

What This Means for AI Tools in Practice

The GIGO Principle in Clinical AI

What Clinicians Should Ask

The Instinct Every Clinician Already Has

More Perspectives

Types of Clinical Data — and Why It Matters for AI

Why Every Clinician Needs AI Literacy — And Where to Begin

The Weekly Brief.