Evaluation of the performance of artificial intelligence platforms in answering and generating new questions in prosthetic dentistry specialization

kuşçu, ALİYE; Çınarer, GÖKALP; KUSCU, SÜHA

doi:10.38053/acmj.1848512

Evaluation of the performance of artificial intelligence platforms in answering and generating new questions in prosthetic dentistry specialization

kuşçu A. i., Çınarer G., KUSCU S.

Anatolian Current Medical Journal, cilt.8, sa.3, ss.409-416, 2026 (TRDizin)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 8 Sayı: 3
Basım Tarihi: 2026
Doi Numarası: 10.38053/acmj.1848512
Dergi Adı: Anatolian Current Medical Journal
Derginin Tarandığı İndeksler: TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.409-416
Yozgat Bozok Üniversitesi Adresli: Evet

Özet

Aims: This study aimed to evaluate large language models (LLMs) not only in answering Dentistry Specialty Examination (DUS) questions but also in generating new DUS-format questions, with expert validation of educational and clinical quality. Methods: A total of 130 official DUS questions published between 2012 and 2021 were used to assess answering performance of four LLMs (ChatGPT, Gemini, DeepSeek, and Grok). Additionally, each model generated 20 new multiple-choice questions (n=80), which were independently evaluated by expert prosthodontists for content accuracy, clinical relevance, discriminative capacity, and conformity with DUS standards. Expert-approved questions were subsequently re-answered by all models to enable cross-model performance analysis. Model performances were compared using descriptive statistics, one-sample proportion tests against chance level (p₀=0.20), and inter-model comparisons using Cochran’s Q and McNemar tests. Results: ChatGPT achieved the highest overall accuracy on historical DUS questions (81.3%), followed by Gemini and DeepSeek (72.8% and 70.3%) and Grok (68.8%). In expert-validated AI-generated questions, overall accuracy rates ranged between 71.3% and 78.8% across models, with no statistically significant inter-model difference (Q=3.82, p=0.28). All models performed significantly above chance level (p<0.001). Importantly, question-generation quality and answering performance were not consistently aligned across models. Conclusion: Although LLMs demonstrate statistically significant performance in DUS-style questions, both answering accuracy and educational validity of AI-generated questions require expert supervision. LLMs should be considered supportive tools rather than autonomous agents in high-stakes dental education and assessment contexts.