What do AI model scores actually mean in clinical settings?

AI & Data Products, Healthcare
March 29, 2026
Sudhir Shandilya

AI in Healthcare · Performance Metrics

Rare diseases expose the hardest truth about AI model scores. When only 3 patients in 100 have the condition, a model that predicts negative for everyone is 97% accurate — and clinically useless. The metrics that matter in rare disease AI are not the ones most people report first.

The four metrics, plainly stated

AUC (Area Under the Curve) measures how well the model ranks patients by risk. An AUC of 0.87 means: if you randomly pick one patient who has the condition and one who does not, the model correctly identifies the higher-risk patient 87 times out of 100. It does not tell you how many patients to flag — and it does not tell you whether the model is actually useful in a rare disease context.

Recall (Sensitivity) is the percentage of true cases the model catches. In rare disease, this is the metric that determines whether patients escape a 5–10 year diagnostic odyssey or are identified early. High recall means fewer missed diagnoses.

Precision (Positive Predictive Value) is the percentage of flagged patients who actually have the condition. In rare disease, where prevalence is low, precision is almost always the harder metric — the model will flag many false alarms for every real case, because real cases are scarce.

F1-score is the harmonic mean of recall and precision: F1 = 2 × (Precision × Recall) / (Precision + Recall). In rare disease, F1 is more honest than accuracy. It cannot be gamed by a model that simply predicts negative for everyone. Both recall and precision must be strong for F1 to be strong.

Key concept

What is a risk threshold — and who actually sets it?

The AI model assigns every patient a risk score from 0 to 1. The threshold is the cutoff you choose: patients above it get flagged for further workup, patients below it do not.

In rare disease, this dial is especially consequential. The condition is already rare — so any threshold that is too high will miss almost every real case. But a threshold that is too low will flood the system with false alarms, triggering expensive or invasive follow-up for patients who do not have the disease.

The threshold is not a technical default. It is a clinical policy decision shaped by the severity of a missed diagnosis, the cost and risk of the confirmatory procedure, and how much diagnostic bandwidth your team can absorb.

Low threshold (e.g. 0.2)Flags many patients. Maximizes recall — few real cases are missed. Generates significant false alarms. Right when the missed diagnosis consequence is severe and the confirmatory test is low-risk.

Balanced threshold (e.g. 0.4–0.5)Balances recall and precision. Appropriate when the confirmatory test carries real risk or cost, and missing some cases is tolerable in the short term.

High threshold (e.g. 0.7)Flags only high-confidence cases. Minimizes false alarms but will miss real patients. Only justifiable when the confirmatory test is highly invasive and the disease progresses slowly.

Choose a clinical scenario

Pulmonary arterial hypertension
Gaucher disease
Acute kidney injury

Pulmonary arterial hypertension (PAH) — 1,000 patients referred to a cardiology clinic with unexplained dyspnea. Approximately 35 will be confirmed to have PAH. The AI model screens all 1,000 using EHR data — spirometry trends, ECG patterns, BNP levels — and flags those at high risk for confirmatory right heart catheterization. A missed PAH (FN) means progressive right heart failure. A false alarm (FP) means an unnecessary invasive catheterization.
Prevalence: ~3.5% in a referred dyspnea population · Humbert et al. ESC/ERS Guidelines 2022

Drag the threshold — watch what changes

Risk threshold

30 flag more patients ←

AUC
0.87
Discrimination power
threshold has no effect

Recall
0.89
% of true cases caught

Precision (PPV)
0.54
% of flags that are real

F1-score
0.67
Balance of both

F1 = 2 × Precision × Recall / Precision + Recall = 2×0.89×0.54 / 1.43 = 0.67

Confusion matrix — 1,000 patients, 35 with PAH

What the model does with every patient

	Predicted positive	Predicted negative
Actual positive	31TP — caught	4FN — missed
Actual negative	26FP — false alarm	939TN — correctly clear

Missed diagnoses (false negatives)
4

Progressive right heart failure — median survival ~2.8 years untreated; each year of diagnostic delay worsens prognosis and treatment response

False alarms (false positives)
26

Unnecessary right heart catheterization — invasive, ~0.1% serious complication rate, significant patient anxiety and cost

PAH is a textbook rare disease AI problem. Even in a high-risk referral clinic, only ~3.5% of patients have the disease. Precision will always be hard. The clinical question is: how many invasive catheterizations are you willing to perform to avoid missing one diagnosis?

The bottom line

In rare disease, the class imbalance is extreme by definition. A model that achieves 97% accuracy by predicting negative for everyone has failed — it has simply learned the prevalence. F1-score exists precisely to expose this failure. Recall and precision must both be reported, and the threshold must be set by clinicians who understand what a missed diagnosis costs and what an unnecessary confirmatory procedure costs.

AUC tells you how good the model’s underlying discrimination is. Threshold tells you how you choose to deploy it. In rare disease, both questions are harder than they look.

PAH metrics: Humbert et al. ESC/ERS Guidelines 2022 · Gaucher disease diagnostic odyssey: Mistry et al. Am J Hematol 2019 · AKI detection: Tomasev et al. Nature 2019 · Widget values are illustrative for teaching purposes · openovation.co

"Innovate with Purpose, Lead with Energy"

What do AI model scores actually mean in clinical settings?

The four metrics, plainly stated

What is a risk threshold — and who actually sets it?

The bottom line

Recent Articles

What do AI model scores actually mean in clinical settings?

The Cognitive Load Crisis in Healthcare — Where It Lives, Why It Persists, and What Fixes It

The Missing Layer in Your AI Stack

The Five Non-Negotiables: A Framework for Ethical AI

Agentic AI Is Not the Risk. Ungoverned Agentic AI Is.

The SCULPT Framework

From Signal Noise to Signal Intelligence – Intent-Driven Agentic AI for Pharmacovigilance Signal Management

Related Articles

What do AI model scores actually mean in clinical settings?

The Cognitive Load Crisis in Healthcare — Where It Lives, Why It Persists, and What Fixes It

The Missing Layer in Your AI Stack

The Five Non-Negotiables: A Framework for Ethical AI

Subscribe to the mailing list to receive the latest articles!

We Support

Subscribe to the mailing list to receive articles updates!