Comparative performance of large language models for patient-oriented support in dental trauma emergencies

Sözen, Emre; Akpınar, Hasan; Yaman, SEVDA

doi:10.1186/s12903-026-08037-8

Comparative performance of large language models for patient-oriented support in dental trauma emergencies

Sözen E., Akpınar H., Yaman S.

BMC Oral Health, cilt.26, sa.1, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 26 Sayı: 1
Basım Tarihi: 2026
Doi Numarası: 10.1186/s12903-026-08037-8
Dergi Adı: BMC Oral Health
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, MEDLINE, Directory of Open Access Journals
Anahtar Kelimeler: Avulsion first aid, Generative AI chatbot, Health information quality, IADT guideline adherence, Luxation management
Yozgat Bozok Üniversitesi Adresli: Evet

Özet

Background: Dental trauma represents a common emergency condition requiring rapid and accurate guidance to prevent permanent damage. In urgent scenarios where access to dental professionals may be delayed, large language models (LLMs) have the potential to provide patients with timely and relevant information. The aim of this study was to comparatively evaluate the dentist rated performance of five widely used LLMs in answering frequently asked questions related to dental trauma emergencies across three predefined components: accuracy, comprehensiveness, and clinical applicability. Methods: Based on the guidelines of the International Association of Dental Traumatology (IADT), the ToothSOS application, and frequently asked patient questions, 27 open-ended questions in Turkish were prepared and divided into five clinical subcategories: avulsion, post-replantation care, luxation, fractures, and other traumas. The questions were posed to the models ChatGPT-4o, Claude 3.5, DeepSeek, Microsoft Copilot, and Gemini 2.0 Flash. Twenty experienced dentists evaluated the responses using a 5-point Likert scale, with the evaluators jointly considering three predefined components: accuracy, comprehensiveness, and clinical applicability. A total of 2,700 individual ratings were analyzed using the Friedman test, Bonferroni-corrected Wilcoxon test, and Intraclass Correlation Coefficient (ICC). Results: A significant difference in overall performance was observed among the models (p < 0.001). ChatGPT-4o achieved the highest mean score (4.63 ± 0.57), whereas Gemini received the lowest score (3.87 ± 0.91). Claude, DeepSeek, and Copilot demonstrated similar, moderate performance, ranging approximately between 4.1 and 4.3. Median values were 5 (IQR 4–5) for ChatGPT-4o and Claude and 4 (IQR 3–5) for the other models, indicating that the responses of ChatGPT-4o and Claude were rated more consistently high. Within each model, no statistically significant differences were observed in mean Likert scores across clinical subcategories (p > 0.05). Conclusions: Although LLMs may not be able to replace professional clinical examination, they may serve as a rapid supportive source of patient-oriented information in dental trauma emergencies where access to a dentist is limited. However, model dependent differences, highlight the need for regular verification before patient facing use.