How properly can AI chatbots imitate medical doctors in a remedy state of affairs?

Dr. Scott Gottlieb is a physician and served as the 23rd head of the U.S. Food and Drug Administration. He is a CNBC contributor and serves on the boards of Pfizer and several other health and technology startups. He is also a partner at venture capital firm New Enterprise Associates. Shani Benezra is a senior research fellow at the American Enterprise Institute and a former associate producer on CBS News' Face the Nation.

Many consumers and healthcare professionals use chatbots based on rich language models to answer medical questions and choose treatment options. We wanted to find out if there are major differences between the leading platforms in terms of their clinical suitability.

To obtain a medical license in the United States, aspiring physicians must successfully complete three stages of the U.S. Medical Licensing Examination (USMLE), with the third and final stage generally considered the most challenging. It requires candidates to answer about 60% of the questions correctly, and in the past the average passing score has been about 75%.

When we subjected major Large Language Models (LLMs) to the same Level 3 exam, they performed significantly better, achieving scores that significantly exceeded those of many doctors.

However, there were some significant differences between the models.

Typically taken after the first year of residency, the USMLE Step 3 test tests whether medical graduates can apply their knowledge of clinical sciences to the independent practice of medicine. It assesses a new physician's ability to manage patient care across a broad range of medical disciplines and includes both multiple-choice questions and computer-based case simulations.

We isolated 50 questions from the 2023 USMLE Step 3 sample test to assess the clinical competency of five different leading large language models by feeding the same set of questions to each of these platforms — ChatGPT, Claude, Google Twins, Grok and Lama.

Other studies have tested the medical performance of these models, but to our knowledge, this is the first time these five leading platforms have been compared head-to-head. These results may provide some insight for consumers and providers about where to turn.

This is how they performed:

  • ChatGPT-4o (Open AI) – 49/50 questions correct (98%)
  • Claude 3.5 (anthropic) – 45/50 (90%)
  • Gemini Advanced (Google) – 43/50 (86%)
  • Grok (xAI) – 42/50 (84%)
  • HuggingChat (Lama) – 33/50 (66%)

In our experiment, OpenAI's ChatGPT-4o emerged as the top performer, achieving a score of 98%. It provided detailed medical analysis using language reminiscent of a doctor. It not only provided answers with detailed reasoning, but also contextualized its decision-making process and explained why alternative answers were less appropriate.

Anthropic's Claude came in second with 90%. The answers were more human, the language was simpler, and the bulleted structure was easier for patients to understand. Gemini, which scored 86%, didn't provide as thorough answers as ChatGPT or Claude, making the reasoning harder to understand, but the answers were succinct and straightforward.

Grok, the chatbot from Elon Musk's xAI, scored a respectable 84%, but did not provide descriptive reasoning during our analysis, making it difficult to understand how it arrived at its answers. While HuggingChat – an open-source website developed by Metas Llama – performed the worst at 66%, but still provided good justifications for the questions it answered correctly and provided accurate answers and links to sources.

One question that most models got wrong involved a 75-year-old woman with a hypothetical heart condition. The question was what the most appropriate next step in her investigation would be. Claude was the only model that got the answer right.

Another notable question involved a 20-year-old male patient who was experiencing symptoms of a sexually transmitted infection. Clinicians were asked which of five options was the appropriate next step in his investigation. ChatGPT correctly identified that the patient should be scheduled for an HIV serology test in three months, but the model went further and recommended a follow-up visit in one week to ensure that the patient's symptoms had resolved and that antibiotics were covering his strain of infection. For us, the response underscored the model's ability to think more comprehensively, beyond the binary choices of the investigation.

These models were not designed for medical reasoning; they are consumer technology products designed for tasks such as language translation and content creation. Despite their non-medical origins, they have shown a surprising suitability for clinical reasoning.

Newer platforms are being developed specifically to solve medical problems. Google recently introduced Med-Gemini, an improved version of its earlier Gemini models, tailored for medical applications and equipped with web-based search capabilities to improve clinical reasoning.

As these models continue to evolve, they will become increasingly better at analyzing complex medical data, diagnosing diseases, and recommending treatments. They could offer a level of precision and consistency that human users sometimes struggle to achieve due to fatigue and errors. And they pave the way to a future where care portals are no longer controlled by doctors, but by machines.

Comments are closed.