When 97% Accurate Means Clinically Useless – What AI Model Scores Actually Mean in Rare Disease

AI & Data Products, Healthcare
March 29, 2026
Sudhir Shandilya

AI in Healthcare · Performance Metrics
Rare diseases expose the hardest truth about AI model scores. When only 3 patients in 100 have the condition, a model that predicts negative for everyone is 97% accurate — and clinically useless. The metrics that matter in rare disease AI are not the ones most people report first.

The four metrics, plainly stated

AUC (Area Under the Curve) measures how well the model ranks patients by risk. An AUC of 0.87 means: if you randomly pick one patient who has the condition and one who does not, the model correctly identifies the higher-risk patient 87 times out of 100. It is not affected by the threshold you choose.

Recall (Sensitivity) is the percentage of true cases the model catches. In rare disease, this determines whether patients escape a 5–10 year diagnostic odyssey or are identified early.

Precision (Positive Predictive Value) is the percentage of flagged patients who actually have the condition. In rare disease, precision is almost always the harder metric — real cases are scarce, so false alarms accumulate quickly.

F1-score is the harmonic mean of recall and precision: F1 = 2 × (Precision × Recall) / (Precision + Recall). It cannot be gamed by a model that predicts negative for everyone. Both metrics must be strong for F1 to be strong.

Key concept

What is a risk threshold — and who actually sets it?

The AI model assigns every patient a risk score from 0 to 1. The threshold is the cutoff you choose: patients above it get flagged for workup, patients below it do not.

In rare disease, a threshold too high will miss almost every real case. A threshold too low floods the system with false alarms, triggering invasive follow-up for patients who do not have the disease. The threshold is not a technical default — it is a clinical policy decision.

Low threshold (e.g. 0.2)Flags many patients. Maximizes recall. Generates significant false alarms. Right when a missed diagnosis carries severe consequences and the confirmatory test is low-risk.

Balanced (e.g. 0.4–0.5)Balances recall and precision. Appropriate when the confirmatory test carries real cost or risk.

High threshold (e.g. 0.7)Flags only high-confidence cases. Minimizes false alarms but misses real patients. Only justifiable when the confirmatory test is highly invasive.

Drag the threshold — watch the metrics change (PAH: 1,000 patients, 35 with the condition)

Risk threshold

40 ← flag more

AUC
0.87

Discrimination power
threshold has no effect

Recall
0.74

% of true cases caught

Precision (PPV)
0.75

% of flags that are real

F1-score
0.74

Balance of both

F1 = 2 × Precision × Recall / Precision + Recall = 2×0.74×0.75 / 1.49 = 0.74

Confusion matrix — PAH example (35 with the condition, 965 without)

What the model does with every patient

	Predicted positive	Predicted negative
Actual positive	26TP — caught	9FN — missed
Actual negative	9FP — false alarm	956TN — correctly clear

Missed diagnoses (FN)
9

Progressive right heart failure — each year of diagnostic delay worsens prognosis

False alarms (FP)
9

Unnecessary right heart catheterization — invasive, ~0.1% serious complication rate

Drag the slider left: more patients caught, more false alarms. Drag right: fewer false alarms, more missed diagnoses. Neither direction is universally correct — context decides.

Three clinical scenarios — same metrics, different stakes

Pulmonary arterial hypertension (PAH)
1,000 patients referred with unexplained dyspnea · ~35 have PAH (prevalence 3.5%)

The AI screens EHR data — spirometry trends, ECG patterns, BNP levels — and flags those at high risk for confirmatory right heart catheterization.

Missed diagnosis (FN)
Progressive right heart failure. Median survival ~2.8 years untreated. Each year of diagnostic delay worsens prognosis and treatment response.

False alarm (FP)
Unnecessary right heart catheterization — invasive, ~0.1% serious complication rate, significant patient anxiety and cost.

How many invasive catheterizations are you willing to perform to avoid missing one diagnosis? That answer belongs to the clinician, not the algorithm.

Gaucher disease
1,000 patients referred with splenomegaly, thrombocytopenia, or bone pain · ~15 have Gaucher disease (prevalence 1.5%)

The AI screens EHR patterns and flags candidates for confirmatory enzyme activity testing. A model predicting negative for everyone would be 98.5% accurate — and have an F1 near zero.

Missed diagnosis (FN)
Progressive splenomegaly, bone marrow infiltration, irreversible bone damage. Average diagnostic odyssey without early identification: 5–10 years.

False alarm (FP)
Unnecessary enzyme activity testing, genetic counseling referral, family anxiety — and reduced clinician trust in AI alerts over time.

This is the extreme version of the rare disease AI problem. F1-score is the only metric that honestly reflects a 1.5% prevalence reality.

Acute kidney injury (AKI)
1,000 inpatients · ~60 will develop AKI within 48 hours (prevalence 6%)

The AI screens labs, medications, and vitals and flags patients for early nephrology review. The 48-hour prediction window is tight — action must follow the flag quickly.

Missed case (FN)
Progressive renal injury — risk of dialysis, prolonged ICU stay, permanent renal damage if nephrotoxic medications are not adjusted in time.

False alarm (FP)
Unnecessary nephrology consults, held medications, delayed procedures — and alert fatigue eroding clinician response to future flags.

Alert fatigue from too many false positives is itself a patient safety issue. The threshold must be set with your team’s alert load in mind.

The bottom line

In rare disease, the class imbalance is extreme by definition. A model that achieves 97% accuracy by predicting negative for everyone has failed — it has simply learned the prevalence. F1-score exists precisely to expose this failure.

Recall and precision must both be reported. The threshold must be set by clinicians who understand what a missed diagnosis costs and what an unnecessary confirmatory procedure costs. The algorithm does not know your patient. You do.

AUC tells you how good the model’s discrimination is. Threshold tells you how you choose to deploy it. In rare disease, both questions are harder than they look.

"Innovate with Purpose, Lead with Energy"

When 97% Accurate Means Clinically Useless – What AI Model Scores Actually Mean in Rare Disease

The four metrics, plainly stated

What is a risk threshold — and who actually sets it?

The bottom line

Recent Articles

The Decision Window — An Anticipatory Framework for Clinical AI That Moves Care

Predict Less, Help More – A Governed Agentic AI System for Hospital Early Warning

Governed by Design – What Regulated Industries Must Demand Before the Age of Agentic AI

The White Coat Meets the Algorithm