AI in Healthcare · Performance Metrics
Rare diseases expose the hardest truth about AI model scores. When only 3 patients in 100 have the condition, a model that predicts negative for everyone is 97% accurate — and clinically useless. The metrics that matter in rare disease AI are not the ones most people report first.
The four metrics, plainly stated
AUC (Area Under the Curve) measures how well the model ranks patients by risk. An AUC of 0.87 means: if you randomly pick one patient who has the condition and one who does not, the model correctly identifies the higher-risk patient 87 times out of 100. It is not affected by the threshold you choose.
Recall (Sensitivity) is the percentage of true cases the model catches. In rare disease, this determines whether patients escape a 5–10 year diagnostic odyssey or are identified early.
Precision (Positive Predictive Value) is the percentage of flagged patients who actually have the condition. In rare disease, precision is almost always the harder metric — real cases are scarce, so false alarms accumulate quickly.
F1-score is the harmonic mean of recall and precision: F1 = 2 × (Precision × Recall) / (Precision + Recall). It cannot be gamed by a model that predicts negative for everyone. Both metrics must be strong for F1 to be strong.
What is a risk threshold — and who actually sets it?
The AI model assigns every patient a risk score from 0 to 1. The threshold is the cutoff you choose: patients above it get flagged for workup, patients below it do not.
In rare disease, a threshold too high will miss almost every real case. A threshold too low floods the system with false alarms, triggering invasive follow-up for patients who do not have the disease. The threshold is not a technical default — it is a clinical policy decision.
Drag the threshold — watch the metrics change (PAH: 1,000 patients, 35 with the condition)
| 40 ← flag more |
0.87
Discrimination power
threshold has no effect
0.74
% of true cases caught
0.75
% of flags that are real
0.74
Balance of both
Confusion matrix — PAH example (35 with the condition, 965 without)
| Predicted positive | Predicted negative | |
| Actual positive |
26TP — caught | 9FN — missed |
| Actual negative |
9FP — false alarm | 956TN — correctly clear |
9
Progressive right heart failure — each year of diagnostic delay worsens prognosis
9
Unnecessary right heart catheterization — invasive, ~0.1% serious complication rate
Drag the slider left: more patients caught, more false alarms. Drag right: fewer false alarms, more missed diagnoses. Neither direction is universally correct — context decides.
Three clinical scenarios — same metrics, different stakes
1,000 patients referred with unexplained dyspnea · ~35 have PAH (prevalence 3.5%)
The AI screens EHR data — spirometry trends, ECG patterns, BNP levels — and flags those at high risk for confirmatory right heart catheterization.
Progressive right heart failure. Median survival ~2.8 years untreated. Each year of diagnostic delay worsens prognosis and treatment response.
Unnecessary right heart catheterization — invasive, ~0.1% serious complication rate, significant patient anxiety and cost.
1,000 patients referred with splenomegaly, thrombocytopenia, or bone pain · ~15 have Gaucher disease (prevalence 1.5%)
The AI screens EHR patterns and flags candidates for confirmatory enzyme activity testing. A model predicting negative for everyone would be 98.5% accurate — and have an F1 near zero.
Progressive splenomegaly, bone marrow infiltration, irreversible bone damage. Average diagnostic odyssey without early identification: 5–10 years.
Unnecessary enzyme activity testing, genetic counseling referral, family anxiety — and reduced clinician trust in AI alerts over time.
1,000 inpatients · ~60 will develop AKI within 48 hours (prevalence 6%)
The AI screens labs, medications, and vitals and flags patients for early nephrology review. The 48-hour prediction window is tight — action must follow the flag quickly.
Progressive renal injury — risk of dialysis, prolonged ICU stay, permanent renal damage if nephrotoxic medications are not adjusted in time.
Unnecessary nephrology consults, held medications, delayed procedures — and alert fatigue eroding clinician response to future flags.
The bottom line
In rare disease, the class imbalance is extreme by definition. A model that achieves 97% accuracy by predicting negative for everyone has failed — it has simply learned the prevalence. F1-score exists precisely to expose this failure.
Recall and precision must both be reported. The threshold must be set by clinicians who understand what a missed diagnosis costs and what an unnecessary confirmatory procedure costs. The algorithm does not know your patient. You do.
AUC tells you how good the model’s discrimination is. Threshold tells you how you choose to deploy it. In rare disease, both questions are harder than they look.










