While deep learning holds immense potential for improving healthcare delivery, it also introduces new risks that must be carefully understood and addressed. As the previous sections have highlighted, these tools can augment clinical decision-making, reduce administrative burden, and improve diagnostic accuracy. However, they also carry limitations that are not always well understood, even as deployment accelerates across healthcare settings. To maximize the benefits of AI while minimizing harm, it is critical to examine the underlying risks posed by current technologies. This section outlines several of the most broadly applicable limitations.
Bias in Training Data
One of the most pressing concerns in the use of deep learning is its susceptibility to bias.33,34 As these models learn from historical data, any patterns—including biased or spurious ones—present in the training dataset can be encoded and even amplified in the model’s outputs. To understand how this distortion occurs, we can return to the toddler playing with the shape sorter toy. Now imagine that the blocks of the shape sorter toy are all different colors: a red sphere, a green pyramid, and so on. Rather than learning to sort blocks based on cross-sectional shapes (e.g., the spherical block goes into the circle-shaped hole), they may instead mistakenly learn to sort based on color (e.g., the red block goes into the circle-shaped hole). If presented with a new toy set with different color-shape combinations, the child might err by attempting to force a red block into the circle-shaped hole—an association that was never correct to begin with.
This type of mislearning has clear parallels in healthcare. For instance, if a deep learning model is trained to predict sepsis risk using historical hospital data, it might latch onto contextual clues—like a patient’s hospital unit—instead of physiologic indicators like blood pressure or white blood cell count. If most sepsis cases in the training set came from the ICU, the model may associate ICU admission with high sepsis risk and underestimate the risk for patients in general wards, even when clinical symptoms are present. Like the toddler misclassifying the red block, the model has learned the wrong rule.
Recent evaluations have shown that this kind of biased learning is not just a theoretical risk but a documented problem for deep learning models. As deep learning models, and LLMs in particular, are trained on enormous unstructured datasets—OpenAI, for example, reportedly “exhausted every reservoir of reputable English-language text on the internet” when training GPT-4 and turned to transcribing YouTube videos to further expand its dataset35—they are susceptible to encoding and perpetuating existing societal and healthcare biases. For instance, GPT-4 systematically generated racially stereotyped clinical vignettes and produced different diagnostic and treatment recommendations based solely on a patient’s race or sex.36 Another analysis showed that LLMs tend to associate higher healthcare costs, longer hospitalizations, and overly optimistic prognoses with specific racial groups, reflecting and reinforcing real-world disparities.37 Even when directly tested with clinical vignettes designed to minimize bias, LLMs nonetheless varied their diagnostic and treatment recommendations based on race, ethnicity, sex, and socioeconomic status.38 Moreover, several commercial models were found to perpetuate outdated, scientifically refuted race-based medical misconceptions, such as differences in pain tolerance and kidney function.39 These findings highlight the growing risk that, without careful oversight, deep learning models may not only replicate historical biases embedded in their training data but also amplify inequities when integrated into clinical practice.
Hallucinations and Errors
Another critical limitation (particularly for LLMs) is the risk of hallucinations, where AI generates outputs that are factually incorrect or misleading. These errors are often presented in a confident and convincing tone, making them difficult to detect without careful human review. A widely publicized example occurred in 2023, when a lawyer submitted a legal brief drafted using ChatGPT that cited six fictitious court cases formatted to appear legitimate.40 The cases were entirely fabricated but sounded plausible enough to go unnoticed until formally challenged.
Hallucinations are not simply the result of poor training; they are a structural consequence of how LLMs are designed. These models are optimized to predict the next most likely word or phrase, not to verify facts.41,42 When confronted with unfamiliar or ambiguous prompts, they may generate content that sounds authoritative but is untrue. In the case of the legal brief, the fabricated citations reflected the model’s fluency in legal language, not actual legal knowledge.
In healthcare, hallucinations pose an even greater risk. Evaluations of AI-generated clinical content have uncovered concerning error rates. Ambient digital scribes have been found to introduce errors in 70 percent of generated clinical notes.43 AI-generated responses to patient messages have been found to have a hallucination rate of 6 percent,31 with 7.1 percent of drafts posing a severe risk of harm.44 In diagnostic applications, performance varies by task and context. A systematic review of over 500 deep learning studies in radiology reported a median diagnostic accuracy of 89.4 percent, though many studies lacked external validation or were at high risk of bias.45 These findings highlight a key challenge: AI systems can appear competent and trustworthy while generating dangerously inaccurate content.
Lack of Transparency
Compounding these concerns is the inherent lack of transparency in how deep learning models operate. As noted earlier, these systems are often described as “black boxes” because their decision-making processes are not easily understood. A model may flag a CT scan as indicative of cancer or estimate a high sepsis risk yet provide no clear rationale for how it reached that conclusion. Returning to the earlier shape-sorter toy analogy: if you ask the toddler why the sphere fits in the circular hole, they might simply reply, “I just know.” Similarly, deep learning models make predictions based on patterns distributed across many layers of numerical weights, not through a process that is logically structured or interpretable in human terms.
This opacity presents a fundamental challenge in clinical environments, where transparency, accountability, and justification are central to safe and ethical care. While the risks of bias and hallucinations underscore the need for human oversight, the lack of transparency complicates that oversight. Clinicians may be inclined to over-trust the polished and plausible outputs of AI tools—especially when those outputs are delivered with apparent confidence but without explanation. This inclination to over-trust confident but unexplainable AI outputs can introduce new risks into clinical decision making. On the other hand, if clinicians lack trust in AI tools altogether, the systems may go underutilized, falling short of their potential to improve care delivery.
For these reasons, simply keeping humans “in the loop” is not enough. It is equally important to understand how human judgment is shaped by AI outputs—both positively and negatively. The following section explores key aspects of human psychology and cognitive bias that can influence a clinician’s ability to evaluate, trust, and correct AI-generated content, and how this may impact the safe and effective adoption of these technologies.
