Grading Dimensions

Each benchmark is graded across six dimensions, with two further descriptive columns. Click any row in the tables below to expand the full rationale behind each grade. Dimensions marked receive A–F grades; others are descriptive.

★ Task Design
Does the eval task reflect real clinical work? Are inputs, outputs, and metrics aligned, and is coverage matched to the claims?
★ Data, Labels & Leakage
Real clinical data vs. synthetic? Quality of reference labels, contamination controls, transparency, and scale.
★ Model-Use Fidelity
Are models, versions, and prompts reported? Is the harness fair across labs, with appropriate baselines and comparators?
★ Scoring Rigor
Reliability of the scoring mechanism. Judge/rubric quality, calibration, uncertainty handling, and error analysis.
★ Robustness & Generalizability
Does the benchmark hold up under scrutiny? Sensitivity checks, generalization, edge cases, replication.
★ Clinical Validity & Safety
Does it correlate with real clinical performance? Clinician comparator, harm/safety analysis, honest deployment framing.
Output Measured
What the model must produce. Descriptive, not graded.
No grade assigned
Results Framing
How performance is reported and contextualised. Descriptive, not graded.
No grade assigned
Grades A — Exemplary B — Strong C — Adequate D — Weak F — Insufficient

Current Evals

11 benchmarks

Benchmarks that evaluate the latest 1–2 generations of frontier models (roughly GPT-4.1 / o3 / Claude 3.7+ era and beyond). Given how rapidly model capabilities are advancing, only these evaluations carry meaningful signal for understanding where frontier AI stands today.

Benchmark Models Tested Task Design Data & Leakage Model-Use Fidelity Scoring Rigor Robustness Clinical Validity Output Measured Results Framing Overall
2026 · OpenAI
COI: Developer-published
GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro, Grok 4.20 A A B B A A Free-text responses to clinician chats (consult, docs, research) 0–100 composite vs. competing models & physician responses B+
Conflict of Interest OpenAI built the benchmark, OpenAI's own product performs best, and the grader is also an OpenAI model (GPT-5.4 at low reasoning). GPT-5.4 is tested in three configurations: base, with browsing, and inside ChatGPT for Clinicians, so the contribution of the product wrapper is at least visible rather than hidden. The cross lab harness gap is a larger problem. Claude Opus 4.7, Gemini 3.1 Pro, and Grok 4.20 are evaluated through base API only, with no comparable product harness or retrieval over peer reviewed literature.
Task Design A
HealthBench Professional does well on task design because it evaluates the kinds of requests clinicians might actually bring to a model: care consults, documentation help, and medical research. The task is easy to interpret (the model has to write the next response in a clinical chat) and the scoring is well matched to that goal, with physician-written rubrics that reward useful content and penalize unsafe or undesirable responses. The length adjustment is also a strength, since it accounts for the tendency of longer answers to score better. Overall, the benchmark has approrpiate coverage for its stated scope, with 525 examples across 28 specialties, and the paper is careful not to present the results as proof of real-world deployment readiness.
Data, Labels & Leakage A
Examples are physician-authored, reviewed, and adjudicated, with difficult cases requiring independent confirmation of both model error and scenario realism. The main limitation is data realism. These are clinically grounded chat tasks, but not clearly derived from real patient records or prior clinical encounters, so they sit between synthetic vignettes and true clinical data. Leakage controls are thoughtful, including a canary string and private held-out set, although the public static release will still be vulnerable to future contamination.
Model-Use Fidelity B
The paper reports evaluated systems clearly, including model families, reasoning settings, verbosity defaults, system message handling, and the number of samples per example. Setup is consistent across conditions, with each model producing the next response to the same clinician chat input and physician responses scored through the same rubric pipeline. The main caveat is harness fairness. ChatGPT for Clinicians includes retrieval over peer reviewed literature while competitor models are evaluated through base API only. The paper acknowledges this and partially disambiguates the harness contribution by running GPT-5.4 in three configurations (base, with browsing, in ChatGPT for Clinicians), and a fair base API comparison is embedded in the results showing base GPT-5.4 at 48.1 still leading Claude Opus 4.7 at 47.0, Gemini 3.1 Pro at 43.8, and Grok 4.20 at 36.1.
Scoring Rigor B
Scoring rigor is strong overall. The benchmark uses physician-written, case-specific rubrics with positive and negative point values, and the main results are based on eight samples per example with confidence intervals, paired tests, and Holm correction. Interpretability is supported by subgroup analyses across use case, dataset slice, specialty, harness condition, verbosity, and reasoning effort.
Robustness & Generalizability A
The paper includes several sensitivity checks, including length adjustment, verbosity sweeps, reasoning-effort analyses, eight samples per example, and performance breakdowns by use case, specialty, and dataset slice. Its adversarial component is also substantial, with roughly 36% of examples coming from red-teaming cases designed to surface failure modes. The main limitation is generalizability. Despite broad specialty coverage, this remains a curated clinician-chat benchmark rather than a multi-site deployment or institution-specific workflow evaluation.
Clinical Validity & Safety A
Clinical validity is strong because the benchmark is anchored to specialty matched physician responses, written with unbounded time and full reference access, and scored against physician written rubrics. Safety is addressed through negative rubric criteria and red teaming cases that test unsafe assumptions, uncertainty handling, and potentially harmful conclusions. The main limitation is that harm is measured implicitly through example level rubric criteria rather than through an explicit harm taxonomy, severity grading, or omission analysis of the kind benchmarks like NOHARM produce. Deployment framing is appropriately cautious, with the paper explicitly noting that benchmark scores are not real world performance rates and that institutions should run their own pre and post deployment evaluation before adopting these tools.
Bottom Line
Key Finding
GPT-5.4 (Clinicians): 59.0GPT-5.4 in ChatGPT for Clinicians scored 59.0, ahead of base GPT-5.4 (48.1), Claude Opus 4.7 (47.0), Gemini 3.1 Pro (43.8), Grok 4.20 (36.1), and specialty matched physician responses (43.7).
Strengths
Real clinician chat tasks across 28 specialties and 52 languages. 525 examples selected from 15,079 candidates, with two independent physicians required to verify any difficult case. Specialty matched physician baseline with unbounded time and web access. About 36 percent of the benchmark is dedicated adversarial red teaming. Length adjusted scoring with an empirically derived coefficient. Eight samples per example with Holm corrected statistics. Contamination controls include a canary string and a private held out set.
Limitations
OpenAI-authored benchmark; evaluates OpenAI’s own product; GPT-5.4 used as grader; competitor models did not receive equivalent product harness/retrieval support; scores reflect enriched difficult cases, not average real-world performance.
2025 · OpenAI
COI: Developer-published
o3, GPT-4.1, Claude 3.7, Gemini 2.5 Pro B C A A B A Free-text responses to 5,000 health conversations 0–1 aggregate; HealthBench Hard & Consensus subsets C+
Conflict of Interest OpenAI publishes and self-administers the benchmark. The top-ranked configuration is GPT-5.4 inside ChatGPT for Clinicians (OpenAI's own retrieval-augmented product) and the grader is GPT-5.4 at low reasoning effort. The paper transparently tests three GPT-5.4 configurations (base, with browsing, in ChatGPT for Clinicians) to disambiguate harness contribution, and even base GPT-5.4 (48.1) outperforms competitor base models. Cross-lab harness asymmetry remains: Claude Opus 4.7, Gemini 3.1 Pro, and Grok 4.20 are evaluated through base API without an equivalent product harness.
Task Design B
Task design is strong for an open ended health benchmark. Models produce the next response in single or multi turn health conversations and are scored against physician written, conversation specific rubrics covering accuracy, completeness, communication quality, context awareness, and instruction following, with negative criteria penalizing unsafe behavior. Coverage spans 5,000 examples across seven themes and five behavioral axes. The score decreases slightly because conversations are mostly synthetic, the benchmark blends layperson and clinician users into one aggregate, and the paper itself notes it does not evaluate performance at the level of specific clinical workflows.
Data, Labels & Leakage C
HealthBench is strong on scale, transparency, and physician annotation, but weaker on data realism. Conversations are mostly synthetic, supplemented by physician red teaming and rewrites of HealthSearchQA, so the dataset reads as realistic simulation rather than real clinical data. Physicians wrote the rubrics and the 34 consensus criteria were multi physician validated, but most example specific criteria had only a single physician reviewer. Leakage controls (canary string, no reproduce request, private held out set) are thoughtful, though the public release remains exposed to future contamination.
Model-Use Fidelity A
The paper reports model identifiers, sampling parameters, reasoning effort sweeps, and grader configuration. The harness is minimal and identical across labs. Every model receives the same conversation with no retrieval, tools, or scaffolding, which gives a clean cross provider comparison at the cost of not testing real deployment conditions. Comparators are appropriate at the time the paper was written, covering current frontier models from OpenAI, Anthropic, Google, xAI, and Meta, plus a multi generation OpenAI trajectory and three physician baseline conditions.
Scoring Rigor A
Scoring rigor is strong. Physician written, conversation specific rubrics with positive and negative point values, graded by GPT-4.1. The meta evaluation is particularly robust. Across 60,896 physician meta examples, GPT-4.1 macro F1 exceeded average physician agreement in five of seven themes. Uncertainty work covers CIs, a 16 run variability analysis, and worst at k reliability curves. Section 6.2 names completeness and context awareness as the weakest axes across all models, which is the kind of failure mode characterization most benchmarks skip. The grader is OpenAI's own model on OpenAI's own benchmark, though the meta evaluation against physicians is the strongest defense available against that concern.
Robustness & Generalizability B
Robustness is partially addressed. The paper runs a 16 run variability analysis, worst at k reliability curves, reasoning effort sweeps, and length controlled win rates, but skips prompt sensitivity testing and uses a single grader without cross provider validation. Generalization is reported by theme and axis but not by language or specialty, which is a gap given that the physician cohort spans 49 languages and 26 specialties. Edge case work is strong. HealthBench Hard isolates the 1,000 hardest examples for headroom, roughly a third of the dataset comes from physician red teaming, and the length controlled win rates directly address the verbosity shortcut.
Clinical Validity & Safety A
Clinical validity is strong. The physician comparator uses three baseline conditions with internet access, no time limit, and instructions to write the response physicians would most want a safe and helpful AI to give. Safety is addressed thematically through the emergency referrals, responding under uncertainty, and context seeking themes, but there is no harm taxonomy, severity grading, or subgroup safety analysis. Deployment framing is appropriately bounded. The paper states that HealthBench does not evaluate specific clinical workflows or measure health outcomes, that the de novo physician baseline is limited because writing chat responses is not a standard physician task, and that real world workflow studies are needed.
Bottom Line
Key Finding
o3: 60%; GPT-4o: 32%o3 was the top-performing model, scoring 60% overall on HealthBench versus 32% for GPT-4o, showing rapid improvement in health-related open-ended model performance.
Strengths
Large open-source benchmark with 5,000 conversations and 48,562 physician-written rubric criteria. Built with 262 physicians across 26 specialties and 60 countries. Includes open-ended multi-turn tasks, physician response baselines, HealthBench Consensus, HealthBench Hard, and meta-evaluation showing GPT-4.1 grader agreement comparable to physician agreement.
Limitations
Most conversations are synthetically generated rather than drawn from real patient encounters or EHR data. The benchmark is OpenAI-authored and uses an OpenAI model as grader. Example-specific criteria are usually written by individual physicians and are not all independently validated.
MedHELM
2025 · Stanford · Microsoft
o3-mini, Claude 3.7, Gemini 2.0, DeepSeek R1 A B B B C C Responses to 35 medical benchmarks, 5 categories Win-rate, macro-average, by-category; public leaderboard C
Task Design A
Task design is the suite's strongest dimension. The 121-task taxonomy was built with 29 clinicians across 14 specialties and validated at 96.7% category agreement, and each benchmark cleanly specifies inputs, prompts, metrics, and gold standards with metric choices matched to task type. Coverage is the weakest sub-dimension. The "complete taxonomy coverage" claim holds at the subcategory level by count, but 15 of 22 subcategories are represented by a single benchmark, which limits the inferential weight any one subcategory result can carry.
Data, Labels & Leakage B
Real clinical material is substantial (12 EHR-based new benchmarks, MIMIC-IV, EHRSHOT, MedAlign, N2C2-CT) but mixed with exam and literature sources (MedQA, PubMedQA, HeadQA, MedBullets). Expert involvement is meaningful at the taxonomy and jury validation stages, but per-benchmark annotation procedures are not uniformly documented, and MIMIC-RRS gold standard issues were filtered rather than re-adjudicated. Leakage mitigation is structural (14 of 35 datasets private) but lacks canary checks or training-cutoff analysis. Transparency is strong via open code and full appendix documentation. Scale supports category-level claims, but 15 of 22 subcategories rest on a single benchmark.
Model-Use Fidelity B
Exact version snapshots, temperatures, context windows, and pricing dates are all reported, and the nine-model comparator set covers the major labs with both reasoning and non-reasoning architectures. Two structural limitations hold it short. The harness is zero-shot and tool-free across all benchmarks, which preserves cross-model fairness but understates achievable performance on tasks where calculators, retrieval, or execution feedback would be standard in deployment. Output parsing, retry handling, and evaluation date are also under-specified relative to what full reproducibility would require.
Scoring Rigor B
Scoring rigor is appropriate but capped at 2 across all four sub-dimensions by coverage rather than design. The LLM jury is properly constructed, clinically grounded at the rubric level, and validated against clinicians with appropriate statistics (ICC, z-normalization, confidence intervals). The principal weakness is that jury validation covers only 2 of 13 open-ended benchmarks, and three of the nine evaluated models also serve as jurors, which is a same-provider judging concern the paper does not fully address. Stratification by benchmark, category, and architecture is clear, but failure-mode characterization is descriptive rather than systematic, and the MDE analysis is the only uncertainty quantification on the headline rankings.
Robustness & Generalizability C
Robustness is the suite's weakest dimension. The sensitivity work that is present (gold-standard quality filtering, minimum detectable effects per benchmark) is well-constructed but narrow, and standard probes like prompt variation, repeated runs, and jury composition ablations are absent. Generalization is reported by task and category but not by the clinical population axes that matter for deployment, including demographics, sites, specialties, and languages. Edge-case content appears as discrete benchmarks (MedHallu, RaceBias, Medec) rather than as systematic stress testing across the suite.
Clinical Validity & Safety C
Clinical validity is split. Deployment honesty is a clear strength, with explicit limitations, no superiority claims, and a careful framing of the work as a benchmark suite rather than deployment evidence. The clinical anchor is the weak point with no head-to-head clinician comparator on any of the 35 benchmarks, and gold standards drawn from heterogeneous sources with acknowledged quality issues. Safety evaluation is partial, with bias, hallucination, error detection, and privacy covered as discrete benchmarks but no systematic assessment of harmful omissions, escalation failures, or uncertainty miscalibration across the suite.
Bottom Line
Key Finding
DeepSeek R1: 66% win-rateDeepSeek R1 led with a 66% win-rate and 0.75 macro-average, edging o3-mini (64%, 0.77 macro-average). Claude 3.5 Sonnet matched ~63% win-rate at roughly 40% lower estimated cost.
Strengths
Clinician-validated taxonomy with 29 reviewers across 14 specialties, 96.7% category agreement. Real EHR data in 12 of 13 new benchmarks. LLM jury beat ROUGE-L (0.36) and BERTScore (0.44) and edged clinician-clinician ICC (0.47 vs 0.43). Public leaderboard, open codebase, full taxonomy coverage.
Limitations
LLM jury validated on only 2 of 13 open-ended benchmarks. 15 of 22 subcategories rest on a single benchmark, limiting subcategory conclusions. Three jurors overlap with tested models (partial same-provider judging). Rubrics applied at benchmark level rather than instance level. Administration & Workflow scored worst across all models, root cause unexplored.
2026 · Lehigh · Harvard · Imperial
GPT-5.2, Claude 3.7, Gemini 3 Pro, Grok 4.1 B B B B C C Free-text answers to live patient Q&A threads 0–1 per-case; pre/post-cutoff & RAG ablations C
Task Design B
Task design is reasonable for an open ended medical Q&A benchmark. Models read a structured patient narrative and query and produce a free text response scored against case specific bipolar rubrics across accuracy, completeness, communication quality, context awareness, and safety. The scoring formula is explicit and aligns with the stated behavioral themes. Realism is limited by single turn text only interaction with narratives reconstructed by LLM agents, and no access to imaging, labs, vitals, or EHR. Coverage spans 38 specialties and two languages, but the long tail leaves several specialties with fewer than 15 cases, and the abstract framing of high stakes clinical settings overreaches given the underlying data are online Q&A and telemedicine threads.
Data, Labels & Leakage B
Leakage control is the headline strength, with weekly harvesting of post Jan 2023 cases, frozen snapshot versioning, and explicit pre vs post-cutoff stratification that 84% of models degrade across. Data realism is moderate since cases are real patient queries answered by verified physicians but are restructured by an LLM curation pipeline and skewed toward online forum demographics. Reference advice is clinically meaningful, but rubric criteria are generated by Qwen3-4B and validated by only two physicians on 50 cases (292 criteria) without formal adjudication. Transparency is strong with named sources, prompts, and released code, while the 2,756 cases and 16,702 criteria support overall claims but leave several specialties (CT Surg, Peds Surg, Pathology) below 15 cases.
Model-Use Fidelity B
Model identifiers and versions are listed in Table 7 with temperature set to zero, but evaluation dates, access mode, system prompts for evaluated models, and reasoning-effort settings for the GPT-5 family are not reported, which is the main fidelity gap. Zero-shot prompting is uniform but the exact prompt template sent to models is not shown. Tool and RAG handling is fair, since all 38 models run without tools matching the chatbot Q&A deployment mode, and an explicit open-book vs closed-book RAG ablation is run on a January 2026 slice. The comparator set is comprehensive across proprietary, open-source, and medical-specific families with a HealthBench cross-benchmark.
Scoring Rigor B
GPT-4.1 grader paired with case-specific bipolar weighted rubrics on five clinical axes is reasonable, but the criteria are Qwen3-4B generated rather than physician-written and validation rests on 50 cases. Reliability evidence is good (Gwet's AC1 of 0.89 on criteria, Macro F1 of 0.76 against a 0.89 human ceiling, Pearson 0.54 vs 0.26 for LLM-as-Judge). Stochastic stability is a weak point, with single-run point estimates at temperature zero across the 38-model leaderboard and no confidence intervals or repeated runs. Interpretability is strong, with stratification across specialties, themes, axes, and pre vs post-cutoff slices, plus a seven-category error taxonomy on the bottom 100 cases per model, though severity weighting and ranking uncertainty are absent.
Robustness & Generalizability C
Robustness is partially addressed through subgroup stratification across specialties, themes, and axes, a closed vs open-book RAG ablation, and pre vs post-cutoff splits, but prompt sensitivity, repeated runs, and independent judge cross-checking are absent. Generalizability is reported across 38 specialties and five themes but is not broken out separately by language or source platform despite the bilingual multi-platform claim. Edge case work is limited to a Jaccard similarity analysis of bottom 100 failure cases across models (mean 0.24), which confirms failures are model-specific rather than data driven but provides no red-teaming, adversarial subset, or high-risk scenario stratification.
Clinical Validity & Safety C
Clinical anchor rests on verified physician advice from the source platforms, RAG-validated curation against guidelines, and 2-physician QA on 50 cases, but there is no head-to-head physician comparator on the same queries. Safety enters the rubric via a dedicated Safety axis, an Emergency Referrals theme, and negative-weighted contraindication criteria, though Safety covers only 6.9% of criteria and subgroup safety and uncertainty calibration are not analyzed. The research-only disclaimer in the ethics section is honest, but the abstract's "high-stakes clinical settings" framing overreaches given the online forum source data.
Bottom Line
Key Finding
Top model (GPT-5.2): 39.2%Even the top model, GPT-5.2, scores only 39.2%, and 84% of the 38 models tested degrade on post-cutoff cases, pointing to widespread contamination on static medical benchmarks.
Strengths
Continuously updated live benchmark with weekly case harvesting and versioned frozen snapshots for reproducibility. 2,756 real patient queries answered by verified physicians across four professional platforms, 38 specialties, and two languages (English and Chinese). Case-specific bipolar rubrics with positive and negative weighted criteria (16,702 total). Multi-agent curation with RAG evidence validation against guidelines. Strong human validation, with Gwet's AC1 of 0.89 on rubric criteria and grader Macro F1 of 0.76 against a 0.89 human ceiling. 38 LLMs tested with explicit post-cutoff and RAG ablations.
Limitations
Cases come from online Q&A and telemedicine forums rather than hospital EHR encounters, so deployment evidence is indirect. Text only, with no imaging or multimodal inputs. GPT-4.1 grader paired with GPT-5.x topping the leaderboard creates a soft same-provider judging concern. Single-temperature runs with no prompt sensitivity, decoding variance, or repeated-run analysis. Human validation rests on 50 cases and two physicians. No clinician comparator or outcome data.
2026 · Stanford · MIT · Binghamton
GPT-4.1, Claude 3.5 v2, Gemini 2.0, DeepSeek-V3 C B B B C B Agentic EHR actions & answers via FHIR tools Pass@1 by difficulty/task; with vs. without memory C−
Task Design C
The benchmark runs 600 agentic tasks (300 from v1, 300 new) against a virtual FHIR EHR, covering lab checks, conditional med ordering, follow-up scheduling, and duplicate-order detection. Coverage is real but uneven: the 300 new tasks come from just two patients, were curated by one physician, and the v2 results test only GPT-4.1. The authors themselves note that instructions are more explicit than how clinicians actually talk, and that the evaluator marks clinically correct answers wrong when the format is off. The "memory transfers across tasks" framing also leans hard on a single worked example.
Data, Labels & Leakage B
The benchmark runs in a virtual FHIR EHR with no description of where the underlying patient data comes from, and the 300 new tasks were seeded from only two patients with the variations created by swapping patient IDs. One physician (KB) curated which tasks made the cut, but the actual reference answers were coded by a co-author and never independently validated. The v1 task set is public on GitHub and the paper says nothing about contamination, training cutoffs, or memorization, which is the main grade-limiting issue here. However, transparency is genuinely good. Prompts, procedures, and code are all available enough for someone else to rerun the pipeline.
Model-Use Fidelity B
The system prompt, memory augmentation prompt, and tool suite are all reproduced in the appendices, which is useful. However, GPT-4.1 is named but never with a dated version, no evaluation date, no decoding settings, and no access mode. Only one model is tested in v2, and the headline gain from 69.67% to 98% mixes a model change (Claude 3.5 Sonnet v2 to GPT-4.1) with the new agent design, though Fig. 5 does show the proper within-model comparison. There is no clinician baseline and no comparison against another agent framework, so the claim that EHR agents are within reach outruns what one model on one benchmark family can support.
Scoring Rigor B
Scoring is programmatic rather than LLM-as-judge, which removes one common source of bias, but the eval functions were written by a co-author and never physician-validated, and the authors themselves show cases where clinically correct outputs were marked wrong on formatting alone. There is no reliability evidence of any kind: no inter-rater agreement, no human-vs-evaluator agreement, no calibration. Every reported number is a single-run point estimate with no confidence intervals, no repeated runs, and no decoding controls, so the headline jumps (69.67% to 98%, 100% on Tasks 9 and 10) cannot be assessed for stability. The error analysis in Sec. 3.4 is the strongest piece, with three named failure modes, worked reasoning traces, and clinical interpretation, but it is illustrative rather than systematic across all 34 misses.
Robustness & Generalizability C
The memory ablation and the 300-task held-out set are the two real strengths. They isolate the memory contribution (91% to 98%) and show the agent design holds on unseen tasks (88.67%). Beyond that the testing thins out: a single fixed prompt, no repeated runs, no variance estimates, and no decoding controls, so prompt sensitivity and stochastic stability are unknown. The "generalization" claim is also narrower than it sounds since he new tasks come from just two patients with MRN swaps, all in the same virtual EHR, on a single model. Edge cases are discussed thoughtfully when they show up in the failure analysis but are never deliberately constructed or stress-tested.
Clinical Validity & Safety B
The clinical anchor is thin. One internal medicine physician curated task quality but did not write the reference answers, run the tasks, or adjudicate agent outputs, and no clinician baseline exists on these tasks. Safety is acknowledged in prose throughout, with the authors themselves noting that clinical use needs above 95% accuracy and pointing to specific harmful failure modes, but none of this is in the metric. Pass@1 weights a wrong dose and a formatting slip the same. The deployment discussion is the strongest piece here, calling for memory auditing, versioning, sandboxed EHR interactions, drift monitoring, and regulatory alignment, and the paper avoids any superiority or deployment-readiness overclaim.
Bottom Line
Key Finding
GPT-4.1: 98% with memoryGPT-4.1 hit 98.0% with memory and 91.0% without, up from v1's best of 69.67% (Claude 3.5 Sonnet v2). Hard tasks (≥3 tool calls) jumped from 63.33% to 96.67%, with Tasks 9 and 10 reaching 100%.
Strengths
Targets v1's failure modes directly: structured FHIR tools replace raw HTTP calls, a sandboxed calculator handles arithmetic, and a "finish" tool enforces output typing. The memory mechanism shows real transfer, a note written for Task 10's HbA1C edge case pulled Task 9 from 56.7% to 100% even though Task 9 had no memory entry. Adds 300 new tasks across 10 clinical categories (imaging surveillance, medication validation, device monitoring, immunization, and others), with blinding between the task author, physician curator, and evaluator. Also includes time-cost table reports per-task tokens (6.3k to 32.4k), latency (7.8 to 20.6 seconds), and cost ($0.01 to $0.07), unlike v1.
Limitations
Only one model is tested (GPT-4.1), so there's no cross-lab comparison and no evidence the design generalizes. The agent sees up to 200 items per search while the evaluator checks against 5,000, which the authors admit produces false failures unrelated to reasoning. The 300 new tasks come from two patients' records, generated by another OpenAI model (o3) and curated by a single physician. Several failures are evaluator rigidity, not clinical errors: one task was marked wrong for returning a single value instead of a pair. Memory is appended to the system prompt with no retrieval, versioning, or audit trail, which the authors themselves flag as a deployment concern. The whole evaluation runs in a virtual EHR with no clinician comparator and no patient outcomes.
BRIDGE
2025 · Brigham/Harvard · Mayo · Stanford · MIT
GPT-4o, Gemini 2.5 Flash, DeepSeek-R1, Llama 4 (95 LLMs) B B C B B C Responses to 87 clinical-text tasks, 9 languages Per-task metric → 0–100; by language/specialty C
Task Design B
BRIDGE covers 87 tasks across 8 task types from 59 datasets in 9 languages and 14 specialties, with 78.2% sourced from real EHR notes or clinical case reports. Inputs are templated by clinical field, output formats are standardized per task type, and primary metrics are matched to task category. The construct is clinical text understanding rather than end-to-end clinical workflow, and the authors note that benchmark scores do not fully equate to LLM performance on specific clinical applications. Coverage outside English is notably lacking, with French, Norwegian, and Portuguese each represented by only 3 tasks.
Data, Labels & Leakage B
78.2% of the 87 tasks come from real EHR notes or case reports and 21.8% from online patient-doctor consultations, drawn from 59 named source datasets across nine languages. Reference labels are inherited from those source datasets without re-annotation, which the authors flag as a limitation, and per-dataset annotation procedures are not uniformly documented. Contamination is checked through a 5-gram token completion analysis at five truncation positions, finding most tasks did not appear in training corpora, but there is no canary string, no temporal separation, and no rolling release. Transparency is strong, with public code, an open data subset on HuggingFace, prompts in supplementary materials, and a continuously updated public leaderboard. The 138,472 test samples easily support overall claims, but several languages and specialties rest on three or fewer tasks.
Model-Use Fidelity C
95 LLMs are evaluated with specific version identifiers such as GPT-4o-0806, greedy decoding at temperature 0, and a fixed random seed, with proprietary models run on Azure and Google Cloud under HIPAA compliance. Evaluation dates are not reported and Gemini snapshots are not specified beyond the family name, which triggers the rubric's MF cap. Three inference strategies are applied uniformly across all models, with templated I/O and documented output handling, though invalid classification responses are filled with random labels rather than retried. No tools or RAG are provided to any model, preserving cross-model fairness but understating achievable performance on coding and QA tasks. The comparator set is broad across proprietary, open-source, and medical-specific families, though o3, Gemini-2.5-Pro, and Med-PaLM2 were excluded due to HIPAA access constraints.
Scoring Rigor B
BRIDGE scores with standard automatic metrics rather than an LLM judge, using accuracy and F1 against expert reference labels for classification, NER, and coding, and ROUGE for QA and summarization. The classification and extraction metrics are well-validated and free of judge bias, but ROUGE for clinical generation is not validated against clinical correctness, and no formal reliability or calibration evidence is reported. Statistical analysis is adequate, with 1,000-iteration bootstrap confidence intervals and pairwise significance tests. Stratification is rich across task type, language, specialty, clinical stage, and inference strategy, but there is no systematic error taxonomy or example failure cases.
Robustness & Generalizability B
The main robustness work in BRIDGE is the three-way comparison of zero-shot, chain-of-thought, and five-shot prompting across all 95 models, alongside comparisons between model generations within the same family and a check for whether benchmark text appeared in training data. Each task uses a single prompt template, so prompt-wording sensitivity is not measured, and models are run only once. Performance is broken out separately by language, specialty, task type, and clinical stage, and the parallel English and Russian versions of MedNLI give a clean cross-lingual comparison, but several languages and specialties rest on three tasks or fewer. There is no red-teaming, no stratification by high-risk clinical scenarios, and no analysis of shortcuts the models might be exploiting.
Clinical Validity & Safety C
BRIDGE relies on the expert-labeled answers from its 59 source datasets rather than direct comparison against clinicians on the same tasks. Safety appears in the discussion through general references to LLM overoptimism and weaker performance in non-English clinical text, but the paper does not analyze harmful recommendations, omissions, or failure to escalate. Deployment framing is a strength, with the authors stating that benchmark scores do not equate to performance on specific clinical applications and positioning BRIDGE as a starting point for model selection rather than evidence of readiness for clinical use.
Bottom Line
Key Finding
≤55.5/100 on real clinical textDeepSeek-R1 scored 92 on USMLE but only 44.2 out of 100 on BRIDGE under zero-shot, and no model exceeded 55.5 even with few-shot prompting (Gemini-1.5-Pro). The gap between exam-style benchmarks and real clinical text performance illustrates the distance between LLM exam scores and EHR-based task ability.
Strengths
Largest multilingual real-world clinical text benchmark to date, with 95 LLMs evaluated across 87 tasks in nine languages, totaling 24,795 experiments and 39.5 million inferences. 78.2% of tasks sourced from real EHR notes or clinical case reports across 14 specialties, with reference labels from expert annotation or structured EHR derivation. Fully automatic scoring with standard metrics avoids LLM-as-judge concerns, supported by 1,000-iteration bootstrap CIs, fixed random seed, and greedy decoding for reproducibility. Includes 5-gram token-completion contamination analysis and a continuously updated public leaderboard with open code and an open dataset subset.
Limitations
Reference labels inherited from original source datasets without clinician re-annotation, which the authors acknowledge. No head-to-head clinician comparator on any task. Models scored only against dataset gold labels. Generation tasks (QA, summarization) evaluated solely with n-gram and embedding metrics (BLEU, ROUGE, BERTScore) rather than clinical accuracy or safety review. Overall score averages across heterogeneous primary metrics, and no explicit evaluation date is reported.
2026 · BIDMC/Harvard · Stanford
o1, o1-preview, GPT-4o vs. physicians A B C A C B Differentials, test choice, reasoning, management Bond / R-IDEA scores vs. hundreds of physicians C+
Task Design A
The six experiments map onto real clinical workflows: differential diagnosis, test selection, reasoning documentation, management decisions, probability estimation, and second opinions at three emergency department touchpoints. Scoring uses validated instruments throughout, including the Bond score, R-IDEA, and physician-consensus rubrics, with inter-rater reliability reported across dual-physician scoring. Two experiments rest on only five or six cases, and the paper's framing claim that LLMs have eclipsed most benchmarks of clinical reasoning reaches beyond what those smaller experiments alone can support.
Data, Labels & Leakage B
Only the 76-case emergency department experiment uses raw EHR data. The other five use published case reports, virtual curriculum cases, or vignettes curated for medical education, which the authors note may overstate performance on messier real-world text. Scoring is a clear strength, with dual-physician adjudication, inter-rater reliability reported across every experiment, and validated instruments like R-IDEA and the Bond score. Contamination protection is uneven: the landmark cases were never publicly released and the CPCs were tested with a pretraining-cutoff split, but the other case sources are widely public and not separately checked. Code and rubrics are on Zenodo, though raw data access is restricted by patient privacy and NEJM licensing.
Model-Use Fidelity C
The paper evaluates o1-preview alongside GPT-4, GPT-4o, and o1, but does not report dated API snapshots, evaluation dates, or decoding parameters in the main text, which triggers the rubric's MF cap. Output handling is rigorous with validated scoring instruments and dual-physician adjudication, though the number of responses generated per case varies across experiments without justification. The evaluation is text-only and runs the LLMs without tools or RAG, while physician comparators in two experiments had access to conventional medical resources. The comparator design is the paper's strongest fidelity element, with hundreds of practicing physicians across multiple training levels and blinded raters in the ER study whose blinding success was quantitatively verified.
Scoring Rigor A
The paper relies on validated clinical scoring instruments throughout, including the Bond score for differentials, R-IDEA for clinical reasoning documentation, and physician consensus rubrics for management cases. Two attending physicians scored every experiment and reported rater agreement each time, and the ER study also verified blinding success by checking whether raters could identify AI from human responses. Statistical handling is solid, including 95% confidence intervals, McNemar's tests, and regression models that account for repeated measures. Results are broken out by diagnostic touchpoint and tested for pretraining contamination, though there is no systematic categorization of failure modes across experiments.
Robustness & Generalizability C
The paper checks for training-data contamination on the NEJM CPCs by splitting cases around the o1-preview pretraining cutoff and finds no significant difference. It also uses six unreleased landmark cases specifically to defend against memorization. Beyond these, there is no prompt sensitivity testing, no decoding ablation, and no repeated runs for most experiments. Generalization rests on running six different task types and stratifying ER cases by triage stage and training level, but all real-world data comes from a single academic center and there is no demographic or specialty breakdown. Edge case work centers on cannot-miss diagnoses in the Healer cases, where the model did not significantly outperform physicians, but no systematic adversarial or shortcut testing is reported.
Clinical Validity & Safety B
Every experiment includes a direct physician baseline matched on specialty and training level, the ER study uses blinded attending raters with verified blinding success, and comparators range from individual attendings to a 553-physician national sample. Safety evaluation is lighter, limited to cannot-miss diagnosis tracking on the Healer cases and an unhelpful test category in CPC scoring, with no systematic harm taxonomy or uncertainty calibration. Deployment framing is honest, with the authors explicitly calling the ER experiment a proof of concept, emphasizing the text-only constraint, and calling for prospective trials rather than claiming readiness for clinical use.
Bottom Line
Key Finding
o1: 67% exact-dx in ERo1-preview outperformed hundreds of physician baselines and prior LLMs across all six experiments, including a 41-percentage-point gap over physicians with GPT-4 access on Grey Matters management cases and a perfect R-IDEA score on 78 of 80 NEJM Healer cases. In a blinded real-world emergency department study at Beth Israel Deaconess, o1 produced exact or very close diagnoses on 67% of cases at initial triage, surpassing two attending physicians, with the largest performance gap occurring when patient information was most limited.
Strengths
Direct head-to-head comparison against hundreds of practicing physicians, including attending physicians and residents, across multiple validated clinical reasoning tasks rather than against dataset gold labels alone. Real ER cases from a major academic medical center scored under blinded conditions, with quantitative verification that raters could not tell human and AI responses apart. Validated scoring instruments throughout (R-IDEA, Bond score). Robust statistical handling including mixed-effects models, McNemar's test, and inter-rater reliability for all dual-scored tasks. Peer-reviewed publication in Science with analysis code and rubrics available on Zenodo.
Limitations
The study evaluates only six reasoning tasks centered on internal medicine and emergency medicine, with no surgical, pediatric, or specialty-specific decision-making. The ER experiment is framed by the authors as a proof of concept since real emergency workflow centers on triage and disposition rather than diagnostic accuracy alone. The evaluation is text-only, excluding the auditory, visual, and physical-exam information clinicians routinely use. The headline model, o1-preview, has since been superseded by o3, and the paper does not report whether performance holds on newer models.
PrIME-LLM
2026 · Mass General Brigham · Harvard
GPT-5, Claude 4.5 Opus, Gemini 3.0 Pro, Grok 4 B C B B D C Sequential select-all answers across 5 case domains PrIME-LLM radar score (0–1); per-domain accuracy D
Task Design B
The benchmark presents 29 MSD Manual vignettes as a sequential workflow from differential diagnosis through management, capturing longitudinal reasoning that single-question benchmarks miss. Inputs, scoring, and the PrIME-LLM metric are clearly specified, but responses are select-all-that-apply selections against a fixed key with all-or-nothing scoring. Coverage fits the construct, though the conclusion that models cannot support unsupervised patient-facing care reaches beyond 29 teaching cases with no clinician comparator.
Data, Labels & Leakage C
The reference standard rests on expert-authored, peer-reviewed MSD answer keys, but model outputs were scored by medical students with no adjudication or inter-rater reliability reported. The vignettes are public and static, and the authors concede that pretraining contamination cannot be excluded, with no canary checks, held-out set, or cutoff analysis to mitigate it. Twenty-nine teaching cases support the core comparisons but are modest for the breadth of the claims, and the leakage exposure caps this domain at C.
Model-Use Fidelity B
All 21 models are named and dated, but exact version snapshots, decoding settings, and verbatim prompts are not reported, and some models were run through web interfaces rather than the API. Context was preserved across the longitudinal vignette, though reasoning modes and tools were deliberately disabled, which likely understates capability for reasoning-optimized models. The cross-model comparator set is appropriate and current, and the absence of a clinician baseline is explicitly bounded.
Scoring Rigor B
Scoring used a deterministic rubric anchored to expert-authored MSD answer keys, but the scorers were medical students and no inter-rater agreement statistic is reported, with single-rater scoring per response. Uncertainty handling is a clear strength, combining triplicate runs, standard errors, ANOVA with Tukey HSD, regression, confidence intervals, and effect sizes. Results are stratified across several clinically meaningful axes and failure is interpreted by domain, though example-level error analysis with severity is absent.
Robustness & Generalizability D
Robustness comes from triplicate runs, broad subgroup analysis across age, sex, modality, reasoning, and family, and a demographic regression, but the select-all format is not stress-tested for prompt sensitivity, option order, or artifacts. Generalizability is shown directly for demographics and the five domains but not for specialties, sites, languages, or settings, and the 29-case strata are underpowered. No edge-case or adversarial analysis is reported.
Clinical Validity & Safety C
The clinical anchor is the expert-authored MSD answer key, with no clinician comparator, patient outcomes, or prospective evidence, which the authors acknowledge. Safety is addressed through the differential-diagnosis weakness and subgroup failure rates but without harm-severity, escalation, or contraindication analysis. Implementation framing is honest, with clear limitations and a clinician-supervised role, though the conclusion that models are unsafe for unsupervised deployment reaches beyond 29 teaching vignettes scored without a human comparator.
Bottom Line
Key Finding
PrIME score: 0.64–0.78PrIME-LLM scores ranged from 0.64 (Gemini 1.5 Flash) to 0.78 (Grok 4), with reasoning-optimized models averaging 0.76 against 0.67 for nonreasoning models. Differential diagnosis was the weakest domain across nearly all models, with failure rates often above 0.80, while final diagnosis was the most reliable, with failure rates below 0.40. Because raw accuracy clustered narrowly between 0.81 and 0.90, the PrIME-LLM metric separated models that conventional accuracy made look similar.
Strengths
The evaluation covers 21 models from five developers across 29 vignettes, totaling 16,254 scored responses, with each vignette run in triplicate. Cases are presented sequentially with preserved context, so models are tested across a full clinical arc from differential diagnosis through management rather than on isolated multiple-choice items. All models were run under matched settings with reasoning and web access disabled, which supports fair cross-vendor comparison. Scoring used an all-correct, no-incorrect rubric against expert-authored, peer-reviewed reference cases, and replicates were scored independently. The PrIME-LLM metric rewards balanced performance and produced wider separation between models than raw accuracy.
Limitations
The vignettes are publicly available and static, so the authors acknowledge that exposure during pretraining cannot be excluded, and the study includes no temporal-separation or leakage control. Reasoning modes were turned off and no augmentation (retrieval, guidelines, calculators, or tools) was used, so results describe baseline rather than maximal performance, and models were accessed through a mix of API and web interfaces. Each response was scored by a single evaluator, and no inter-rater reliability statistic is reported. No clinician comparator was included, and the authors state the study was not designed to test model-versus-human equivalence.
2026 · Stanford · Google DeepMind
COI: Developer-published
AMIE (DeepMind) + 9 cardiologists A B B B D A Cardiologist case assessments, with vs. without AI Subspecialist A/B preference; error & omission rates C+
Conflict of Interest The evaluated system, AMIE, is a Google product, the study was funded by Alphabet, and many authors are Alphabet employees who may hold company stock. This is a developer evaluating its own model, so the favorable results carry a clear conflict. Several Stanford authors also report unrelated industry ties. Blinding of the subspecialist evaluators and open data partly offset, but do not remove, this concern.
Task Design A
The study tests a realistic workflow: general cardiologists assessing real patients with suspected genetic cardiomyopathy, with and without AI help, across triage, diagnosis, and management. Inputs, forms, and the blinded preference rubric are clearly specified and well matched to the question being asked. Coverage is appropriately narrow and deep, though the framing occasionally reaches toward broader cardiac care than one single-center study supports.
Data, Labels & Leakage B
The main strength is real, well-documented patient data from a leading subspecialty center, with the full dataset openly released and the trial registered. Evaluation drew on qualified subspecialists, though each case had a single rater with no adjudication or formal ground-truth diagnosis. Private institutional data limits contamination, but no explicit cutoff check is reported, and the single-center sample is small for broad claims.
Model-Use Fidelity B
AMIE was run with a multistep procedure combining web search and self-critique, and the assisted clinicians could question it through a live chat, which is close to how such a tool would actually be used in practice. Reporting is thinner, with the base model named but no version snapshot, decoding settings, or run date. The main fairness gap, which the authors flag, is that AMIE read only text reports while cardiologists also saw the raw images.
Scoring Rigor B
Scoring is generally strong as blinded subspecialists applied a clinically grounded rubric, and the error analysis gives concrete, severity-rated examples of hallucinations and omissions rather than just aggregate numbers. The statistics fit the paired design, with appropriate tests and bootstrapped intervals. The clear weakness, however, is reliability: single-rater scoring per case with no inter-rater agreement or calibration evidence reported.
Robustness & Generalizability D
Randomization, blinding, and bootstrapped intervals give the core comparison solid protection against bias, and the subspecialists checked for demographic applicability. However, all data come from a single English-speaking center, results are not stratified by subgroup, and the authors concede they cannot test generalization across sites or populations. There are no repeated model runs and no designed stress or artifact testing.
Clinical Validity & Safety A
The clinician comparator is a strength, with general cardiologists managing the same real patients unassisted, randomized and blindly judged against the assisted arm. Safety is measured directly through errors, omissions, and a quantified hallucination rate, though subgroup and escalation coverage is less robust. The deployment framing is honest and well bounded, stating clearly that the system is not ready for autonomous clinical use.
Bottom Line
Key Finding
AMIE-assisted preferred: 46.7%Subspecialists preferred AMIE-assisted assessments in 46.7% of cases versus 32.7% for cardiologists alone, with the rest tied, and assisted assessments were favored for management and diagnostic testing. Assisted responses had fewer clinically significant errors (13.1% versus 24.3%) and less missing content (17.8% versus 37.4%). Cardiologists reported the system helped in 57% of cases and saved time in about half of them.
Strengths
This is a registered, CONSORT-compliant, blinded randomized trial on 107 consecutive real patients using genuine multimodal cardiac data, a step above vignette-based testing. Subspecialist evaluators were blinded to which assessment used AI, preference and error judgments used appropriate paired statistics with bootstrapped intervals, and the dataset and ten-domain rubric were released openly. The design isolates the effect of AI assistance on real clinical reasoning.
Limitations
The system read text reports rather than raw images, patients came from one US specialty center in English only, and the outcome was subspecialist preference rather than patient outcomes. Cardiologists knew their assignment, and the evaluators shared an institution with the model's developers. About 6.5% of AMIE responses contained clinically significant hallucinations. The base model, Gemini 2.0 Flash, is now several generations behind the frontier.
2026 · Stanford · Harvard
Custom GPT-4 + 70 physicians A C B A B A Clinician diagnoses under 3 AI-collaboration modes 19-point rubric scores vs. conventional resources B−
Task Design A
The study targets a sharp, well-defined question, namely how workflow sequencing shapes clinician-AI collaboration, and the interactive design with independent assessments, a synthesis step, and open dialog mirrors real teamwork better than a single prompt. Inputs, response structure, and the 19-point rubric are clearly specified. The main limit is that vignettes assume rather than test history-taking and examination, but the conclusions stay appropriately bounded as exploratory.
Data, Labels & Leakage C
The material is six vignette-based cases drawn from real de-identified patients, which gives some grounding but is not representative of real practice. Grading is a strength, with two board-certified physicians, an established rubric, and reconciliation of discrepancies, though correctness beyond the final diagnosis rests on expert judgment. The protocol, prompt, and rubric are well documented, but withholding the cases limits both replication of the core task and clinical coverage.
Model-Use Fidelity B
Prompting is the heart of the study and is handled well, with a published multi-part collaborative system prompt, fresh context for each case, and a structured synthesis-and-dialog workflow that reflects deliberate use. The three-arm design plus an AI-alone benchmark, anchored to the authors' identical 2024 study, isolates the collaborative prompt rather than model change. The base model is named only as GPT-4, without an exact version, decoding settings, or run date.
Scoring Rigor A
Scoring is the strongest part of the study. Every case was double-graded by blinded board-certified physicians using an established rubric, with high inter-rater agreement (ICC 0.91) and a defined reconciliation rule. The statistics fit the nested design, with a prespecified mixed-effects model, confidence intervals, and a power calculation, and the authors even examine the model's non-determinism. Correctness beyond the final diagnosis still rests on grader judgment.
Robustness & Generalizability B
Randomization, blinded double grading, a prespecified sensitivity analysis, and a verification for each protocol guard the core results, and the authors actively probe failure modes, documenting sycophantic anchoring, run-to-run non-determinism, and the unfair advantage vignettes may give the model. Generalizability is the weak point, with a narrow, AI-literate, almost entirely internal-medicine sample and only six cases.
Clinical Validity & Safety A
The comparator is strong, with the same internal medicine physicians tested on identical vignettes using conventional resources and each AI workflow, randomized and blindly graded, plus an AI-alone benchmark. Safety gets attention, including cases where AI lowered scores, sycophancy, and automation-bias concerns, though harm severity and escalation are not formally measured. The deployment framing is honest, repeatedly bounding the work as exploratory rather than clinical evidence.
Bottom Line
Key Finding
Assisted: 85% vs. 75% aloneBoth collaborative workflows raised clinician accuracy over conventional resources, 85% for AI-as-first-opinion and 82% for AI-as-second-opinion versus 75%, with no significant difference between the two workflows. AI alone scored highest at about 90% and was not statistically beaten by either assisted arm. The gains came mainly from lifting the lowest-scoring cases rather than improving every case uniformly.
Strengths
This is a registered, IRB-approved, CONSORT-compliant randomized trial with blinded dual-physician grading and very high inter-rater agreement. It tests a true design question, how workflow sequencing shapes collaboration, rather than just whether AI helps. Direct comparison to the authors' earlier identical-vignette study isolates the effect of the collaborative prompt, and the anchoring and dialog analyses add useful mechanistic insight.
Limitations
The cases are vignettes, not real encounters, so history-taking, examination, and test selection are assumed rather than assessed, and the authors call the work hypothesis-generating. Each clinician completed few cases, the sample skews toward internal medicine and AI-literate volunteers, and self-selection likely inflated the positive attitudes. The system also showed non-determinism, occasional missing syntheses, and anchoring on clinician input.
2026 · Mount Sinai (Icahn)
ChatGPT Health (gpt-5-mini) B B B A B A Forced four-level triage choice + crisis interstitial Mistriage rate by acuity; under- vs. over-triage B−
Task Design B
The study tests ChatGPT Health on its actual job, triage, running 60 clinician-authored vignettes across 21 domains through the real consumer interface and scoring each against a guideline-anchored physician gold standard. Inputs, outputs, and the accuracy metric are all clearly specified, and treating undertriage as the worse error matches the clinical stakes. However, coverage thins at the emergency end, where the headline finding rests on only two scenarios.
Data, Labels & Leakage B
Three physicians independently assigned every gold-standard triage level against 85 cited guidelines, reaching near-perfect agreement (Fleiss' κ = 0.90), and the full set of vignettes, prompts, model responses, and analysis code is posted openly with code that regenerates the reported results. The cases are clinician-authored synthetic vignettes rather than real encounters, with no evidence offered that they match real-world case distributions. Emergency and demographic strata are thinly populated, which the authors acknowledge limits power for those subgroups.
Model-Use Fidelity B
The study tested the live ChatGPT Health product through its real consumer interface, using one standardized prompt across all 960 queries with fresh threads to block memory carryover and a clean structured output that parsed without refusals. Because no competitor model is involved, harness asymmetry that can undercuts fairness does not arise, and the setup matches actual deployment. Reproducibility is limited, since the backbone is named (gpt-5-mini) but decoding settings and the platform system prompt are not exposed, and the forced single-letter, single-turn format with clarifying questions suppressed departs from how the tool is really used.
Scoring Rigor A
Three physicians set every gold-standard triage level against 85 cited guidelines and reached near-perfect agreement, and model outputs were scored by deterministic matching rather than an LLM judge, removing the judge-bias problems that affect many medical benchmarks. Uncertainty is handled thoroughly with cluster bootstrapping, mixed-effects regression, odds ratios with confidence intervals, and Holm correction. The error analysis is the paper's strongest feature, stratifying by acuity and condition and tracing specific failure mechanisms, though each condition was run only once so run-to-run model variability is left unmeasured.
Robustness & Generalizability B
A built-in 2×2×2×2 factorial varies anchoring, access barriers, race, and sex, with a subjective-versus-objective sensitivity analysis and results reported separately by acuity, subgroup, and a dedicated edge-case stratum. The adversarial anchoring manipulation and deliberate loading of emergencies and crisis scenarios make it a genuine stress test. Generalization stays within synthetic vignettes, demographic cells are underpowered, and prompt sensitivity and repeated runs were not done.
Clinical Validity & Safety A
The clinical anchor is a guideline-based reference standard set by three physicians against 85 cited guidelines, though no head-to-head clinician comparator establishes a human error rate on the same vignettes. Safety is the paper's strength, measuring failure to escalate, anchoring-driven reassurance, inconsistent crisis-line activation, and subgroup safety. The evidence is labeled honestly as a synthetic stress test, stating the system should not be deployed on trust alone.
Bottom Line
Key Finding
51.6% of emergencies undertriagedChatGPT Health undertriaged 51.6% (33/64) of gold-standard emergencies to 24–48 h evaluation rather than the emergency department, while overtriaging 64.8% (83/128) of nonurgent cases. The accuracy curve forms an inverted U that fails hardest at the clinical extremes.
Strengths
Independent stress test of a deployed consumer product run through its real interface, with a fresh thread per case so nothing carried over, mirroring actual user conditions. Sixty clinician-authored vignettes across 21 domains, with gold-standard triage set by three physicians at near-perfect agreement (Fleiss' κ = 0.90) and anchored to published guidelines rather than author opinion. A within-vignette factorial design varies anchoring, access barriers, race, and sex to isolate each nonclinical factor. Scoring is against human physicians rather than another model, so there is no grading circularity, and all vignettes, prompts, responses, and code are released openly.
Limitations
Synthetic vignettes rather than real patient interactions, though the authors argue this is a conservative test. Single time point on a single backbone (gpt-5-mini, January 2026), with behavior expected to shift as the product updates. Forced single-letter triage output rather than the hedged advice open-ended use would produce. Within-vignette design is underpowered for small demographic effects, with wide CIs and only 16–19 events per cell. One standardized prompt with no prompt-sensitivity analysis, and emergency undertriage concentrated in trajectory-dependent conditions (asthma, DKA), so generalization to other acute presentations is untested.

Historic Evals

1 benchmark

Benchmarks that do not evaluate the latest 1–2 model generations. Given the pace of capability advancement, results from these benchmarks provide limited signal about current model performance. They retain methodological value — particularly NOHARM's real clinical data and harm framework — but should not drive current model selection decisions.

Benchmark Models Tested Task Design Data & Leakage Model-Use Fidelity Scoring Rigor Robustness Clinical Validity Output Measured Results Framing Overall
NOHARM
2024 · Stanford · Harvard
GPT-5, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4 (31 systems) A A B A A A Management actions from a structured menu Harm counts by severity vs. physicians; NNH B+
Task Design A
The benchmark evaluates clinical management on 100 real primary care–to–specialist eConsults that keep the missing context and uncertainty clinicians actually work with, scoring recommendations against a harm construct defined on a RAND-UCLA and WHO severity scale. Inputs, the recommendation output, and harm-weighted metrics are all specified with explicit formulas. Coverage is broad across 10 specialties and 31 models, though each specialty holds only 10 cases and the physician-superiority claim rests on a 30% subset against 10 internal medicine doctors.
Data, Labels & Leakage A
The benchmark is built on 100 real physician-to-specialist eConsults from Stanford, anonymized and lightly edited, with a gold standard from 29 board-certified physicians who scored 12,747 options blinded on a RAND-UCLA and WHO severity scale at 95.5% concordance. Contamination risk is low because the cases are private institutional data with zero-retention vendor handling, though no explicit leakage checks or provenance dates are reported. Stratification is thin, with only 10 cases per specialty and a 30% subset behind the human comparison.
Model-Use Fidelity B
Every one of the 31 models is named with its version and accessed through documented API and gateway endpoints, with standardized zero-shot prompts released in full and run across two variants and 10 trials, plus a prompt-sensitivity analysis. The comparator set is a strength, pairing a fair no-AI head-to-head against 10 physicians with reference and ablation baselines. Fidelity gaps remain, since no evaluation date is reported, the menu-selection format does not match real use, and retrieval-equipped RAG products share a leaderboard with toolless base models.
Scoring Rigor A
Harm scoring rests on expert human adjudication, with 29 board-certified physicians rating each action on a clinically grounded RAND-UCLA and WHO severity scale, blinded to one another at 95.5% concordance, and the model is graded by deterministic option matching rather than an LLM judge. Every model runs in 10 trials across 2 prompts with negative binomial regression, confidence intervals, and FDR correction. The error analysis is a highlight, decomposing commission and omission and tracing why over-precise models lose safety.
Robustness & Generalizability A
The study stress-tests its results with 10 repeated trials per model, a Maximizer and Avoider prompt-sensitivity analysis, and a weight-sensitivity check on the Safety index, and rules out the obvious shortcut by showing recommendation volume does not track harm. Generalization spans 10 specialties and 31 models, but all cases are single-site outpatient eConsults skewed to low acuity, so transfer to other settings is untested.
Clinical Validity & Safety A
The benchmark runs a fair head-to-head against 10 board-certified physicians under no-AI conditions, an appropriate generalist comparator for eConsult decisions, layered on a validated specialist reference standard. Safety assessment is the study's core, scoring commission, omission, escalation, and severity across intervention types. The deployment framing is candid about single-turn format, no chart review, and the low-acuity single-site skew, though the headline that top models beat physicians on safety reaches a bit past its 30-case retrospective base.
Bottom Line
Key Finding
Up to 22.2% severe-harm riskAcross 31 LLMs, severe potential harm occurred in up to 22.2% (95% CI 21.6–22.8%) of cases, with errors of omission accounting for 76.6% of severe harm. Safety correlated only moderately with existing AI and medical-knowledge benchmarks (r = 0.61–0.64), and the best models modestly outperformed generalist physicians on Safety.
Strengths
Built on 100 real primary care–to–specialist eConsults from a tertiary academic center rather than tidy textbook vignettes, so the cases keep the missing context and genuine uncertainty clinicians actually face. A panel of 29 board-certified physicians, mostly specialists, produced 12,747 harm annotations across 4,249 management options using a combined RAND-UCLA and WHO severity scale, reaching 95.5% concordance. Coverage is wide, with 31 models plus RAG and multi-agent configurations run in repeated trials. A fair head-to-head against 10 physicians using only conventional tools anchors the comparison, and the case set, leaderboard, and code are released openly.
Limitations
Cases come from outpatient eConsults, which skew toward low-acuity but puzzling problems, so the findings may not transfer to inpatient care or routine primary care visits. As a single-turn benchmark with no chart review, the authors added minor clarifying details in some cases, which may have lifted both model and human scores. Actions were scored as largely independent even though real management decisions are linked, and harm ratings reflect academic-center practice patterns. The physician comparison covered only a 30% subset, and the tested cohort now sits roughly two generations behind the current frontier.

Note on Historic classification: Benchmarks in this section use model cohorts that predate the GPT-4.1 / o3 / Claude 3.7 Sonnet generation — roughly the frontier as of early 2025. Given the pace of capability change, results from these evaluations should be treated as providing methodological signal and historical context, not current model performance data. NOHARM is highlighted as a methodological exemplar despite its historic model set — its real-data, physician-comparison framework represents a gold standard that current-generation benchmarks should aspire to replicate.