A doctor, pressed for time after a long clinic, asks ChatGPT to summarise the current evidence on a drug dosing question in a specific patient population. Within seconds, the model returns a confident, well-structured answer. Three citations are provided, neatly formatted. The response reads like something from a good review article.

Two of the three citations do not exist. They are plausible-sounding journal references — correct journal names, plausible author surnames, credible publication years — but the papers were never written. The third citation is a real paper, but it says something meaningfully different from what the model attributed to it.

This is not a failure of the technology. This is the technology working exactly as designed. Understanding why that is the case — and what it implies for how clinicians should and should not use large language models — is the purpose of this article.

What an LLM Actually Does

A large language model, or LLM, is a type of AI trained on an enormous volume of text. ChatGPT, Gemini, Claude, and similar systems have read — in the sense of having processed as training data — a substantial fraction of publicly available text on the internet, along with books, academic papers, and other sources.

What the model learned from all this text is not facts. It learned patterns: the statistical regularities of language itself. Given a sequence of words, an LLM predicts what word is most likely to come next. Given the words “the patient presented with”, it calculates, based on billions of examples of medical and clinical text, what words typically follow. It generates text that is statistically consistent with the patterns it learned.

This is a profound and genuinely impressive capability. It produces text that is fluent, contextually appropriate, often accurate, and capable of synthesising complex topics in accessible language. It is also, fundamentally, not the same as knowing something. The model does not retrieve facts from a database. It does not check what it generates against a verified source. It generates text that looks like what you would expect accurate text to look like on this topic — which is usually accurate, but is not guaranteed to be.

The analogy that most clinicians find useful: an LLM is like a very well-read colleague who can speak fluently and confidently about almost any topic — but who hasn’t necessarily verified what they’re saying. Most of what they say is accurate. Some of it is plausibly wrong. And they express both with exactly the same confidence.

This is not a bug being fixed in the next version. It is a structural feature of how language models work. The hallucination problem — the generation of fluent, confident, factually incorrect outputs — will be reduced as models improve, but it will not be eliminated, because the mechanism that produces fluent correct text is the same mechanism that produces fluent incorrect text. There is no internal truth-verification step.

What LLMs Are Genuinely Good At for Clinicians

This understanding of what LLMs actually do makes it possible to identify where they are and are not appropriate clinical tools. The key principle is straightforward: LLMs are most useful in contexts where a knowledgeable human reviews the output before it is acted on, and where the cost of an error in that output is low.

Drafting patient communication letters. An LLM can produce a first-draft letter explaining a diagnosis, outlining a treatment plan, or summarising a clinical discussion in plain language — adapted to the patient’s apparent health literacy level. The clinician reviews, edits, and approves. The risk of an LLM error in this context is low because the clinician, not the LLM, is the final authority on the content.

Summarising a clinical paper. Given the full text or abstract of a paper, an LLM can produce a readable summary of the study design, population, findings, and limitations. This is a genuinely time-saving use case. The important caveat: the summary should be verified against the original paper for any factual claim that will influence a clinical decision. Use the LLM to reduce reading time, not to replace reading.

Explaining a diagnosis in plain language. A clinician who has made a diagnosis and wants to explain it to a patient in accessible terms can ask an LLM to draft that explanation. Again, the clinician reviews and corrects. The output is a starting point, not a clinical statement.

Generating a differential diagnosis list for review. An LLM can rapidly generate a list of conditions consistent with a described presentation. This is most useful as a checklist function — prompting the clinician to consider conditions they might not have listed first, not as a primary diagnostic tool. The clinician’s judgment drives the differential; the LLM surfaces possibilities for consideration.

Literature search orientation. An LLM is useful for identifying the key topics, terminology, and conceptual landscape of an unfamiliar clinical area — helping a clinician know what to search for in PubMed, what terms are used in the literature, and what the main research questions have been. It is not a substitute for actual literature searching in verified databases.

Structuring notes, abstracts, or presentations. LLMs are excellent at organising and structuring text according to a defined format. A clinician who provides their notes in prose and asks the LLM to structure them into a standard format (SOAP note, abstract, referral letter) can save time on formatting while retaining control over clinical content.

Translating clinical jargon. For interdisciplinary communication — explaining a finding from a specialist to a generalist, or summarising a complex situation for a patient or family — an LLM can produce accessible language quickly.

Using ChatGPT and LLMs as clinical tools — appropriate use cases on the left including drafting patient letters, summarising literature, explaining diagnoses in plain language, and generating differential diagnosis lists for clinician review; high-risk uses on the right including drug dosing, verifying clinical facts, inputting patient data, and acting on unverified citations
Where LLMs add genuine clinical value — and where they require extreme caution

What LLMs Cannot Be Trusted With

The same structural feature — fluent output without internal truth verification — makes LLMs genuinely dangerous in contexts where accuracy is safety-critical.

Drug dosing and drug interactions. An LLM may provide a drug dosing recommendation that is accurate for standard adult populations but incorrect for the specific patient population, renal function, or drug combination in question. It will present the incorrect answer with the same fluency as the correct one. Drug dosing questions should be answered from verified pharmacology resources — BNF, Micromedex, local formularies — not from language models.

Verifying clinical facts. If the question is “is this clinical fact true?”, an LLM is the wrong tool. It will answer confidently, and it may be right. But the verification comes from the clinician’s own knowledge or from a verified source, not from the model’s output. An LLM is not a fact-checking system.

Any decision where the stakes of being wrong are high. The general principle is: if acting on an incorrect LLM output would harm a patient, the output should not be acted on without independent verification from a reliable source.

Research citations. As illustrated in the opening of this article, LLMs hallucinate citations. This is well-documented and consistent. Any citation provided by an LLM must be independently verified before use. This is not paranoia — it is a documented characteristic of how these systems generate text.

Patient-identifiable data. See below.

The Hallucination Problem in Practical Terms

Clinicians who encounter the hallucination problem for the first time often look for explanations: the model was “confused,” or the question was too specific, or it was a low-quality version of the system. None of these explanations is accurate.

Hallucination occurs because the model is generating statistically plausible text, not retrieving verified facts. When asked for a citation on a specific topic, the model generates author names, journal titles, and publication years that pattern-match the conventions of academic citation in that field. The result looks exactly like a real citation. It frequently is not.

Treat every factual claim from an LLM as a hypothesis to be verified, not a conclusion. This is not excessive caution — it is an accurate calibration of what the technology does.

This is genuinely different from how clinicians typically use reference tools. When a clinician looks up a drug interaction in a pharmacology database, they are retrieving a verified record. When they ask a language model the same question, they are receiving a generated response that may or may not match verified records. The user interface looks similar. The underlying process is entirely different.

The practical guidance is simple: for any factual claim from an LLM that you plan to act on, verify it from an independent source before acting.

How to Write Better Prompts

The quality of an LLM’s output is substantially affected by the quality of the input. This is the core of “prompt thinking” — the practice of constructing inputs that produce more useful, more reliable outputs.

Specify role and context. A prompt that begins “You are helping a consultant physician prepare a brief for a multidisciplinary team meeting. The patient is a 65-year-old with newly diagnosed stage III non-small cell lung cancer” will produce more focused, more relevant output than a prompt that begins “tell me about lung cancer treatment.” The model uses everything in the prompt as context — the more relevant context provided, the more useful the output.

Ask for reasoning, not just conclusions. “Explain your reasoning step by step” or “Walk me through the considerations that inform this answer” produces output that is both more useful and easier to evaluate. When the model shows its reasoning, the clinician can identify where the logic is sound and where it is not. A conclusion without reasoning is harder to audit.

Request uncertainty flags. “Indicate where you are uncertain or where the evidence is limited or contested” is a prompt instruction that consistently improves output quality. Models that are prompted to acknowledge uncertainty tend to produce more calibrated outputs — identifying where the clinical evidence is clear versus where it is equivocal.

Use iterative refinement. The most useful LLM interactions are not single exchanges — they are dialogues. Start with a broad question, then refine with follow-up: “You mentioned X — can you elaborate on the evidence for that?” or “What are the main counterarguments to this position?” The model can be interrogated, challenged, and asked to reconsider in ways that surface gaps and errors.

A concrete illustration of the difference:

Weak prompt: “What’s the dose of vancomycin?”

Stronger prompt: “I am a hospital physician managing a patient with suspected MRSA bacteraemia. The patient is 70 years old, has CKD stage 3b with an eGFR of 35, and weighs 80kg. What are the key considerations for vancomycin dosing in this specific population, and what monitoring would you recommend? Please indicate where local protocol or pharmacy consultation should be the primary guide.”

The stronger prompt does not eliminate the need to verify the answer — it never does. But it produces more contextually relevant output, is more likely to flag the need for dose adjustment and monitoring, and is more likely to indicate where expert consultation is needed rather than implying that the LLM’s answer is sufficient.

Prompt thinking for clinicians — showing the difference between a vague prompt and a well-structured clinical prompt, with four key techniques: specifying role and context, asking for step-by-step reasoning, requesting uncertainty flags, and using iterative follow-up questions
Four prompting techniques that produce more reliable, more useful LLM outputs

Data Privacy and Patient Confidentiality

This point is non-negotiable and worth stating clearly: patient-identifiable information must never be entered into a public large language model.

When a clinician inputs text into ChatGPT, Gemini, or equivalent public consumer services, that text is transmitted to the provider’s servers. Depending on the service’s terms and privacy settings, it may be used in future model training. The privacy protections that apply to clinical data — HIPAA in the United States, GDPR in Europe, equivalent frameworks elsewhere — are not satisfied by the terms of service of consumer AI products.

This means that prompts containing patient names, dates of birth, identifying clinical details, or any combination of information that could identify an individual patient are inappropriate for public LLM tools.

Healthcare-specific deployments exist — Microsoft Copilot for Healthcare and similar institutional products — that operate under data processing agreements appropriate for clinical environments. These are substantively different from consumer tools. Clinicians using AI in institutional settings should be aware of which tools are approved for use with patient data and which are not.

When in doubt, the rule is simple: if you would not want the text in a patient’s record to appear in a training dataset or server log, do not put it in a public LLM prompt.

What This Means for Clinical Practice

LLMs are, right now, a genuine tool in many clinical workflows. The AMA has engaged substantively with the question of AI in medicine, and the clinical community increasingly treats LLMs as an emerging professional competency rather than an optional curiosity.

The clinicians who use LLMs most effectively are not those who use them most or least — they are those who understand what they are. They treat LLM outputs as drafts requiring review, not conclusions requiring endorsement. They verify factual claims before acting on them. They apply strong prompting practices that produce more useful and more auditable outputs. They never input patient data into unsecured tools.

Used with this clarity, LLMs genuinely reduce administrative burden, accelerate literature engagement, and support communication tasks that take time without requiring the clinical judgment that only the clinician can supply. That is a real and meaningful contribution to clinical work.

Used with appropriate scepticism — treating outputs as drafts to verify, not conclusions to act on — LLMs are useful. Used credulously, they represent a new category of clinical risk. The difference is knowing what they actually are.

The clinician who understands the distinction is better positioned than one who either refuses LLMs entirely — forgoing genuine efficiency gains — or adopts them uncritically. The position to take is neither avoidance nor deference. It is informed, sceptical, professional use.


For a broader introduction to AI literacy in clinical practice, see Why Every Clinician Needs AI Literacy. For a foundation in interpreting AI performance metrics, see Biostatistics to AI.