When 97% Accurate Means Clinically Useless – What AI Model Scores Actually Mean in Rare Disease


AI in Healthcare · Performance Metrics
Rare diseases expose the hardest truth about AI model scores. When only 3 patients in 100 have the condition, a model that predicts negative for everyone is 97% accurate — and clinically useless. The metrics that matter in rare disease AI are not the ones most people report first.

The four metrics, plainly stated

AUC (Area Under the Curve) measures how well the model ranks patients by risk. An AUC of 0.87 means: if you randomly pick one patient who has the condition and one who does not, the model correctly identifies the higher-risk patient 87 times out of 100. It is not affected by the threshold you choose.

Recall (Sensitivity) is the percentage of true cases the model catches. In rare disease, this determines whether patients escape a 5–10 year diagnostic odyssey or are identified early.

Precision (Positive Predictive Value) is the percentage of flagged patients who actually have the condition. In rare disease, precision is almost always the harder metric — real cases are scarce, so false alarms accumulate quickly.

F1-score is the harmonic mean of recall and precision: F1 = 2 × (Precision × Recall) / (Precision + Recall). It cannot be gamed by a model that predicts negative for everyone. Both metrics must be strong for F1 to be strong.

Key concept

What is a risk threshold — and who actually sets it?

The AI model assigns every patient a risk score from 0 to 1. The threshold is the cutoff you choose: patients above it get flagged for workup, patients below it do not.

In rare disease, a threshold too high will miss almost every real case. A threshold too low floods the system with false alarms, triggering invasive follow-up for patients who do not have the disease. The threshold is not a technical default — it is a clinical policy decision.

Low threshold (e.g. 0.2)Flags many patients. Maximizes recall. Generates significant false alarms. Right when a missed diagnosis carries severe consequences and the confirmatory test is low-risk.
Balanced (e.g. 0.4–0.5)Balances recall and precision. Appropriate when the confirmatory test carries real cost or risk.
High threshold (e.g. 0.7)Flags only high-confidence cases. Minimizes false alarms but misses real patients. Only justifiable when the confirmatory test is highly invasive.

Drag the threshold — watch the metrics change (PAH: 1,000 patients, 35 with the condition)

40 ← flag more
AUC
0.87

Discrimination power
threshold has no effect
Recall
0.74

% of true cases caught
Precision (PPV)
0.75

% of flags that are real
F1-score
0.74

Balance of both
F1 = 2 × Precision × Recall / Precision + Recall = 2×0.74×0.75 / 1.49 = 0.74

Confusion matrix — PAH example (35 with the condition, 965 without)

What the model does with every patient

Predicted positive Predicted negative
Actual
positive
26TP — caught 9FN — missed
Actual
negative
9FP — false alarm 956TN — correctly clear

Missed diagnoses (FN)
9


Progressive right heart failure — each year of diagnostic delay worsens prognosis

False alarms (FP)
9


Unnecessary right heart catheterization — invasive, ~0.1% serious complication rate

Drag the slider left: more patients caught, more false alarms. Drag right: fewer false alarms, more missed diagnoses. Neither direction is universally correct — context decides.

Three clinical scenarios — same metrics, different stakes

Pulmonary arterial hypertension (PAH)
1,000 patients referred with unexplained dyspnea · ~35 have PAH (prevalence 3.5%)

The AI screens EHR data — spirometry trends, ECG patterns, BNP levels — and flags those at high risk for confirmatory right heart catheterization.

Missed diagnosis (FN)
Progressive right heart failure. Median survival ~2.8 years untreated. Each year of diagnostic delay worsens prognosis and treatment response.
False alarm (FP)
Unnecessary right heart catheterization — invasive, ~0.1% serious complication rate, significant patient anxiety and cost.

How many invasive catheterizations are you willing to perform to avoid missing one diagnosis? That answer belongs to the clinician, not the algorithm.

Gaucher disease
1,000 patients referred with splenomegaly, thrombocytopenia, or bone pain · ~15 have Gaucher disease (prevalence 1.5%)

The AI screens EHR patterns and flags candidates for confirmatory enzyme activity testing. A model predicting negative for everyone would be 98.5% accurate — and have an F1 near zero.

Missed diagnosis (FN)
Progressive splenomegaly, bone marrow infiltration, irreversible bone damage. Average diagnostic odyssey without early identification: 5–10 years.
False alarm (FP)
Unnecessary enzyme activity testing, genetic counseling referral, family anxiety — and reduced clinician trust in AI alerts over time.

This is the extreme version of the rare disease AI problem. F1-score is the only metric that honestly reflects a 1.5% prevalence reality.

Acute kidney injury (AKI)
1,000 inpatients · ~60 will develop AKI within 48 hours (prevalence 6%)

The AI screens labs, medications, and vitals and flags patients for early nephrology review. The 48-hour prediction window is tight — action must follow the flag quickly.

Missed case (FN)
Progressive renal injury — risk of dialysis, prolonged ICU stay, permanent renal damage if nephrotoxic medications are not adjusted in time.
False alarm (FP)
Unnecessary nephrology consults, held medications, delayed procedures — and alert fatigue eroding clinician response to future flags.

Alert fatigue from too many false positives is itself a patient safety issue. The threshold must be set with your team’s alert load in mind.

The bottom line

In rare disease, the class imbalance is extreme by definition. A model that achieves 97% accuracy by predicting negative for everyone has failed — it has simply learned the prevalence. F1-score exists precisely to expose this failure.

Recall and precision must both be reported. The threshold must be set by clinicians who understand what a missed diagnosis costs and what an unnecessary confirmatory procedure costs. The algorithm does not know your patient. You do.

AUC tells you how good the model’s discrimination is. Threshold tells you how you choose to deploy it. In rare disease, both questions are harder than they look.