Stop Chasing AI Perfection in Healthcare—Build Trust by Matching Human Performance

AI & Data Products, Product Leadership, Product Management
August 7, 2025
Sudhir Shandilya

Artificial intelligence (AI) is often expected to outperform humans in healthcare because the stakes are high. A more rational approach is to evaluate AI against the real‑world performance of clinicians, not an impossible ideal of zero mistakes. Every healthcare process – triage, diagnostics, discharge planning, billing – relies on human judgment that is inherently variable and imperfect. By making these baseline errors visible and comparing AI against them, healthcare leaders can adopt AI as a decision‑support assistant rather than a replacement for clinicians. This evidence‑driven perspective reframes AI adoption from hype or fear into a trust‑building process.

Why the Human Baseline Matters

Human error is pervasive but rarely measured

Radiology: Retrospective studies show that 20–30 % of lung cancers visible on chest radiographs are missed[1]. These misses often occur because pulmonary nodules overlap normal anatomy, and about 90 % of diagnostic errors in lung cancer occur on chest x‑rays[1]. Some modern radiology literature estimates that 19–26 % of lung cancers may be missed on chest x‑ray[2].
Colonoscopy: Even experienced endoscopists miss adenomas. Reviews report that up to 25 % of adenomas can be missed[3], and a meta‑analysis of white‑light colonoscopy estimated that about one‑third (34 %) of adenomas are missed[4]. Flat or small lesions are frequently overlooked, and one study noted that 20–30 % of large adenomas may develop into cancer when missed[5].
Triage: In a retrospective study of >96 000 emergency department encounters, nurses initially mis‑classified triage levels in about 17 % of cases. Misclassification was associated with older age and vital‑sign abnormalities; under‑triaged patients had higher admission and critical illness rates[6].
Complex workflows: For diabetic eye disease screening, referral‑based programs achieve completion rates of only 35–72 % because of extra visits and transportation barriers[7]. Stroke care is also time‑sensitive; every 15 minutes of treatment delay reduces the chance of a good outcome, yet coordination delays are common[8].

These observations show that human performance is variable and fallible. Importantly, such baseline errors are often invisible because routine clinical practice rarely measures miss rates or misclassification. Establishing human baselines allows AI to be fairly judged by asking whether it provides incremental improvements over these known limitations.

Trust implications

AI does not need to outperform humans outright to be valuable. Matching human detection while reducing workload or reducing delays already adds measurable value.
Transparency in trade‑offs is key: AI may increase false positives, but if it dramatically reduces false negatives or workload, the net gain may justify its use.
Both AI and human performance must be continuously measured and audited. Comparing AI to human baselines—not perfection—provides a more rational benchmark for adoption.

Evidence From Major Clinical Domains

Mammography Screening: AI Reduces Workload Without Compromising Detection

The MASAI trial (Lancet Digital Health 2023) compared AI‑supported mammography screening with standard double reading. AI–supported screening had similar cancer detection rates (6.4 vs 5.0 cancers per 1 000 participants) and similar recall rates, while reducing screen‑reading workload by 44.3 %[9]. A registry analysis of this trial in 2025 confirmed that AI increased cancer detection and positive predictive value without increasing false positives and maintained a 44.2 % reduction in workload[10].
Trust implication: By matching human detection and reducing workload, AI functions as a reliable second reader that allows radiologists to spend time on complex cases. It does not need to be perfect; it needs to be better than or equal to current practice.

Colonoscopy: AI Catches What Humans Miss

Human baseline: Standard colonoscopy misses roughly 25 % of adenomas[3], and white‑light colonoscopy may miss one‑third (34 %) of adenomas[4]. These missed lesions can progress to cancer[5].
AI‑assisted performance: A 2024 meta‑analysis of 28 randomised controlled trials (23 861 participants) found that AI‑assisted colonoscopy increased adenoma detection rates by 20 % and reduced adenoma miss rates by 55 %. While withdrawal times increased slightly (about 0.15 minutes) and non‑neoplastic polyp removal increased, the overall benefit was consistent across endoscopist experience levels[11].
Trade‑off: The improved detection comes at the cost of longer procedures and potential for more false positives. However, catching additional adenomas likely prevents cancers and justifies the slight increase in procedure time.

Sepsis: Lessons From COMPOSER and Epic Sepsis Models

Human baseline: Sepsis recognition is challenging; usual care often detects sepsis late, leading to inconsistent bundle compliance and high mortality.
AI in practice: The COMPOSER model (npj Digital Medicine 2024) was deployed in two emergency departments. Compared with usual care, it led to a 1.9 % absolute (17 % relative) reduction in in‑hospital sepsis mortality, a 5 % absolute increase in sepsis bundle compliance, and a 4 % reduction in SOFA score progression[12]. Counterfactual analysis suggested that without COMPOSER, mortality would have been 11.39 %, whereas the model achieved 9.49 % mortality[13].
External validation matters: Not all sepsis AI tools work as advertised. An independent validation of the Epic Sepsis Model (ESM) involving 27 697 patients found a hospitalisation‑level AUC of 0.63 (95 % CI 0.62–0.64), with 33 % sensitivity, 83 % specificity, and a positive predictive value of 12 %[14]. The model identified only 7 % of patients who did not receive timely antibiotics, missed 67 % of sepsis cases, and generated alerts for 18 % of all hospitalisations[15]. In other words, the model produced a high alert burden with limited benefit.
Trust implication: AI can improve sepsis outcomes when transparently developed and validated (e.g., COMPOSER). However, proprietary models may underperform in real‑world settings, highlighting the need for local benchmarking against human baselines.

Diabetic Retinopathy Screening: AI Closes Workflow Gaps

Human baseline: Youth with diabetes often do not complete annual eye exams; referral‑based programs have completion rates of 35–72 %[7].
AI‑assisted performance: The ACCESS trial (Nature Communications 2024) randomised youth with type 1 and type 2 diabetes to an autonomous AI diabetic eye exam at the point of care versus usual referral. The AI group achieved 100 % completion of diabetic eye exams (95 % CI 95.5–100 %) compared with 22 % in the referral group. Among those with abnormal AI results, 64 % completed follow‑through with an eye‑care provider versus 22 % in the control group[16].
Trust implication: This trial shows that AI can address workflow bottlenecks and social determinants (transportation, extra visits) rather than simply matching diagnostic performance. By delivering immediate results at the point of care, AI increases adherence and equity.

Stroke Coordination: AI Reduces Delays

Human baseline: Timely treatment is critical for stroke; every 15 minutes of delay reduces the chance of a good outcome[8]. Human coordination often suffers from delays in notifying neurointerventionists.
AI‑enabled coordination: The VALIDATE study analysed 14 116 stroke alerts across 166 hospitals using the Viz.ai mobile platform. AI‑enabled hospitals had a 39.5‑minute faster median time from patient arrival to neurointerventionalist contact compared with non‑AI hospitals – a 44 % reduction in delay[17]. After adjusting for pre‑alert biases, AI still reduced contact time by 30.5 minutes (31.8 %)[18]. The platform also reduced variability in workflow without significantly affecting thrombolysis rates[8].
Trust implication: AI contributes by speeding communication and coordination, a known weakness of human systems. It does not replace physicians but augments the stroke team’s efficiency.

Ethical and Operational Lessons

Incremental gains, not perfection: AI should be judged by whether it improves clinical outcomes relative to current practice. Even modest improvements—such as a 20 % increase in adenoma detection or a 44 % reduction in stroke coordination time—can translate into lives saved.
Transparent trade‑offs: AI may increase false positives or procedure times. These trade‑offs must be openly communicated and weighed against the benefits. For example, AI colonoscopy slightly lengthens withdrawal time but halves miss rates[11].
Continuous measurement and shared accountability: Both human and AI systems should be audited. The poor real‑world performance of the ESM underscores the necessity of independent validation[14]. Hospitals should measure their clinicians’ baseline performance (e.g., triage misclassification, missed cancers) and then compare AI tools against those baselines.
Equity in baselines: Human performance varies across populations and settings. AI tools must be tested and calibrated across diverse demographics and hospital types. For instance, the ACCESS trial used an AI system validated on multiethnic populations and improved screening in under‑served youth[16][7]. AI adoption should avoid reinforcing disparities by ensuring algorithms are trained on representative data.

Building a Culture of Trust

To encourage responsible AI adoption, healthcare leaders should:

Measure human performance first: Establish clear metrics for missed diagnoses, misclassification, and delays. Making these errors visible provides a baseline for improvement.
Benchmark AI against humans, not perfection: Evaluate whether AI adds value relative to current practice and whether it reduces workload, increases detection, or speeds care.
Communicate trade‑offs openly: Discuss potential increases in false positives or procedure time and involve clinicians and patients in decision‑making.
Monitor both AI and human performance over time: Use quality‑improvement processes to refine algorithms and clinician behaviour. Shared accountability ensures that AI remains an assistant rather than an unmonitored black box.

AI is not flawless—but neither are humans. By anchoring evaluation in the Human Baseline, healthcare systems can shift adoption from hype or fear to rational trust. AI should be judged not by whether it is perfect, but by whether it makes care better, safer, and fairer than the current reality. Recognising where humans currently stand does not lower the bar; it reveals where the bar truly is.

[1] Commonly Missed Findings on Chest Radiographs: Causes and Consequences – PMC

https://pmc.ncbi.nlm.nih.gov/articles/PMC10154905

[2] Bone suppression detects more lung nodules on chest x-rays | AuntMinnie

https://www.auntminnie.com/clinical-news/digital-x-ray/article/15614508/bone-suppression-detects-more-lung-nodules-on-chest-x-rays

[3] Combining techniques and technologies increases adenoma detection rates in colonoscopy: More is more – PMC

https://pmc.ncbi.nlm.nih.gov/articles/PMC12362565

[4] [5] One in three adenomas could be missed by white-light colonoscopy – findings from a systematic review and meta-analysis – PMC

https://pmc.ncbi.nlm.nih.gov/articles/PMC11908064

[6] Accuracy of emergency department triage using the Emergency Severity Index and independent predictors of under-triage and over-triage in Brazil: a retrospective cohort analysis | International Journal of Emergency Medicine | Full Text

https://intjem.biomedcentral.com/articles/10.1186/s12245-017-0161-8

[7] [16] Autonomous artificial intelligence increases screening and follow-up for diabetic retinopathy in youth: the ACCESS randomized control trial | Nature Communications

https://www.nature.com/articles/s41467-023-44676-z

[8] [17] [18] Frontiers | VALIDATE—Utilization of the Viz.ai mobile stroke care coordination platform to limit delays in LVO stroke diagnosis and endovascular treatment

https://www.frontiersin.org/journals/stroke/articles/10.3389/fstro.2024.1381930/full

[9] Artificial intelligence-supported screen reading versus standard double reading in the Mammography Screening with Artificial Intelligence trial (MASAI): a clinical safety analysis of a randomised, controlled, non-inferiority, single-blinded, screening accuracy study – PubMed

https://pubmed.ncbi.nlm.nih.gov/37541274

[10] Screening performance and characteristics of breast cancer detected in the Mammography Screening with Artificial Intelligence trial (MASAI): a randomised, controlled, parallel-group, non-inferiority, single-blinded, screening accuracy study – PubMed

https://pubmed.ncbi.nlm.nih.gov/39904652

[11] Use of artificial intelligence improves colonoscopy performance in adenoma detection: a systematic review and meta-analysis – PubMed

https://pubmed.ncbi.nlm.nih.gov/39216648

[12] [13] Impact of a deep learning sepsis prediction model on quality of care and survival | npj Digital Medicine

https://www.nature.com/articles/s41746-023-00986-6

[14] [15] External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients – PMC

https://pmc.ncbi.nlm.nih.gov/articles/PMC8218233

AI, Breaking Barriers: Open Innovation Insights, Healthcare, Human Baseline, Sudhir Shandilya

"Innovate with Purpose, Lead with Energy"