Dr. Scott Gottlieb is a physician and served as the 23rd Commissioner of the U.S. Food and Drug Administration. He is a CNBC contributor and is a member of the boards of Pfizer and several other startups in health and tech. He is also a partner at the venture capital firm New Enterprise Associates. Shani Benezra is a senior research associate at the American Enterprise Institute and a former associate producer at CBS News’ Face the Nation.
Many consumers and medical providers are turning to chatbots, powered by large language models, to answer medical questions and inform treatment choices. We decided to see whether there were major differences between the leading platforms when it came to their clinical aptitude.
To secure a medical license in the United States, aspiring doctors must successfully navigate three stages of the U.S. Medical Licensing Examination (USMLE), with the third and final installment widely regarded as the most challenging. It requires candidates to answer about 60% of the questions correctly, and historically, the average passing score hovered around 75%.
When we subjected the major large language models (LLMs) to the same Step 3 examination, their performance was markedly superior, achieving scores that significantly outpaced many doctors.
But there were some clear differences between the models.
Typically taken after the first year of residency, the USMLE Step 3 gauges whether medical graduates can apply their understanding of clinical science to the unsupervised practice of medicine. It assesses a new doctor’s ability to manage patient care across a broad range of medical disciplines and includes both multiple-choice questions and computer-based case simulations.
We isolated 50 questions from the 2023 USMLE Step 3 sample test to evaluate the clinical proficiency of five different leading large language models, feeding the same set of questions to each of these platforms — ChatGPT, Claude, Google Gemini, Grok and Llama.
Other studies have gauged these models for their medical proficiency, but to our knowledge, this is the first time these five leading platforms have been compared in a head-to-head evaluation. These results could give consumers and providers some insights on where they should be turning.
Here’s how they scored:
- ChatGPT-4o (Open AI) — 49/50 questions correct (98%)
- Claude 3.5 (Anthropic) — 45/50 (90%)
- Gemini Advanced (Google) — 43/50 (86%)
- Grok (xAI) — 42/50 (84%)
- HuggingChat (Llama) — 33/50 (66%)
In our experiment, OpenAI’s ChatGPT-4o emerged as the top performer, achieving a score of 98%. It provided detailed medical analyses, employing language reminiscent of a medical professional. It not only delivered answers with extensive reasoning, but also contextualized its decision-making process, explaining why alternative answers were less suitable.
Claude, from Anthropic, came in second with a score of 90%. It provided more human-like responses with simpler language and a bullet-point structure that might be more approachable to patients. Gemini, which scored 86%, gave answers that weren’t as thorough as ChatGPT or Claude, making its reasoning harder to decipher, but its answers were succinct and straightforward.
Grok, the chatbot from Elon Musk’s xAI, scored a respectable 84% but didn’t provide descriptive reasoning during our analysis, making it hard to understand how it arrived at its answers. While HuggingChat — an open-source website built from Meta’s Llama — scored the lowest at 66%, it nonetheless showed good reasoning for the questions it answered correctly, providing concise responses and links to sources.
One question that most of the models got wrong related to a 75-year-old woman with a hypothetical heart condition. The question asked the physicians which was the most appropriate next step as part of her evaluation. Claude was the only model that generated the correct answer.
Another notable question, focused on a 20-year-old male patient presenting with symptoms of a sexually transmitted infection. It asked physicians which of five choices was the appropriate next step as part of his workup. ChatGPT correctly determined that the patient should be scheduled for HIV serology testing in three months, but the model went further, recommending a follow-up examination in one week to ensure that the patient’s symptoms had resolved and that the antibiotics covered his strain of infection. To us, the response highlighted the model’s capacity for broader reasoning, expanding beyond the binary choices presented by the exam.
These models weren’t designed for medical reasoning; they’re products of the consumer technology sector, crafted to perform tasks like language translation and content generation. Despite their non-medical origins, they’ve shown a surprising aptitude for clinical reasoning.
Newer platforms are being purposely built to solve medical problems. Google recently introduced Med-Gemini, a refined version of its previous Gemini models that’s fine-tuned for medical applications and equipped with web-based searching capabilities to enhance clinical reasoning.
As these models evolve, their skill in analyzing complex medical data, diagnosing conditions and recommending treatments will sharpen. They may offer a level of precision and consistency that human providers, constrained by fatigue and error, might sometimes struggle to match, and open the way to a future where treatment portals can be powered by machines, rather than doctors.