
Board-Certified Physicians Put Popular LLMs Through Their Paces, and Found Real Problems
When people feel a strange pain or notice a worrying symptom, more and more of them are skipping the doctor’s office and heading straight to an AI chatbot. It’s fast, free, and available at 3 a.m. But a study suggests that convenience might come with a serious catch: even the best-performing AI gets medical questions wrong roughly one out of every five times.
In a preprint study (not yet peer-reviewed) posted online by researchers from Penn State, four popular AI chatbots were put to the test using real and imagined health concerns submitted by university students, staff, and faculty. A panel of nine board-certified physicians then graded the AI responses. Overall results were mixed: impressive enough to turn heads, but flawed enough to raise real concerns about what happens when someone acts on bad medical advice.
Nearly one in four adults under 30 already use AI monthly for health-related guidance, according to data cited in the paper. Understanding what these tools get right (and wrong) is essential.
How Researchers Tested AI Chatbots on Health Questions
Researchers organized a university-wide competition in fall 2024. A total of 34 participants were invited to query one of four AI chatbots — ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro, and Llama3-8b — with health-related questions they might genuinely want answered. Participants could approach the task from one of three angles: as a patient describing personal symptoms, as a medical professional seeking diagnostic help, or through an out-of-the-box track that allowed for alternative medical query scenarios, such as analyzing images of handwritten prescriptions.
Competition entries generated 212 AI responses in total. Those responses were then divided among a panel of nine board-certified physicians, each of whom graded them on four measures: how valid the information was, the quality of the information, how well the AI reasoned through the problem, and whether the response could cause harm.
Gemini-1.5 Pro produced the largest share of responses, 140 out of 212, while Llama3-8b generated only 6. That imbalance matters when comparing models directly, and the researchers acknowledged it as a limitation.
What Doctors Found When They Graded the AI Responses
Across all four AI models, about 76% of responses were rated as valid by physicians. That sounds reasonable until the math flips: nearly one in four responses didn’t make the cut. For ChatGPT-4o, the highest-performing model, validity hit 84.6%, still leaving more than 15% of answers falling short. Llama3-8b landed at the bottom, with only half its responses rated as valid.
Which type of medical question was asked also mattered. Questions about obstetrics and gynecology scored the highest for accuracy, while neurology, internal medicine, and dermatology consistently ranked lower. Neurology cases in the study often involved rare conditions that are hard to diagnose under any circumstances, while dermatology relies heavily on visual examination — something a text-based chatbot simply cannot replicate.
Prompt length turned out to be a factor, too. Very short questions and very long, detailed ones both produced weaker results. Best performance came from medium-length queries, somewhere between 60 and 250 characters. Medical professionals said in follow-up interviews that the more specific and focused the question, the better the AI tended to perform.
Adding a Medical Encyclopedia Didn’t Always Help AI Chatbots
One of the study’s more surprising results involved a technique called Retrieval-Augmented Generation, or RAG, essentially giving the AI access to a curated library of medical textbooks, clinical guidelines, and research articles from a university medical school before it generates a response. Grounding the AI in vetted medical sources should, in theory, make its answers more reliable.
Seven medical professionals were recruited to compare standard AI responses against RAG-enhanced ones, side by side. For Gemini-1.5 Pro and Llama3-8b, the medical professionals actually preferred the standard, unenhanced versions by a wide and statistically significant margin. For the ChatGPT models, there was no significant difference either way.
Researchers stopped short of declaring RAG unhelpful overall, noting that the results varied by model and that future research should explore the approach further.
Source : https://studyfinds.com/best-ai-chatbot-gets-health-questions-wrong-doctors-find/