Is Readability a Proxy for Reliability? A Qualitative Evaluation of ChatGPT-4.0 in Orthopaedic Trauma Communication

Atahan, Mehmet; Üner, Çağrı; Aydemir, Mehmet; Uzun, Mehmet; Yalın, Mustafa; GÖLGELİOĞLU, FATİH

doi:10.1111/jep.70238

Is Readability a Proxy for Reliability? A Qualitative Evaluation of ChatGPT-4.0 in Orthopaedic Trauma Communication

Atahan M. O., Üner Ç., Aydemir M., Uzun M. F., Yalın M., GÖLGELİOĞLU F.

Journal of Evaluation in Clinical Practice, cilt.31, sa.5, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 31 Sayı: 5
Basım Tarihi: 2025
Doi Numarası: 10.1111/jep.70238
Dergi Adı: Journal of Evaluation in Clinical Practice
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, CAB Abstracts, CINAHL, MEDLINE, Psycinfo
Anahtar Kelimeler: artificial intelligence, health education, natural language processing, orthopaedic procedures, readability
Yozgat Bozok Üniversitesi Adresli: Evet

Özet

Aim: This study aimed to evaluate the accuracy, readability, and safety of ChatGPT-4.0's responses to frequently asked questions (FAQs) related to orthopaedic trauma and to examine whether readability is associated with the quality and reliability of content. Methods: Ten common patient questions related to orthopaedic emergencies were submitted to ChatGPT-4.0. Each response was assessed independently by three orthopaedic trauma surgeons using a 4-point ordinal scale for accuracy, clinical appropriateness, and safety. Readability was calculated using the Flesch-Kincaid Grade Level (FKGL). Inter-rater agreement was analysed using intraclass correlation coefficients (ICC). The presence of disclaimers was also recorded. Results: ChatGPT-4.0's responses had a mean FKGL score of 10.5, indicating high school-level readability. Stratified analysis showed comparable readability scores across response quality categories: excellent (10.0), poor (9.8), and dangerous (10.1), suggesting that readability does not predict content reliability. Accuracy and safety scores varied considerably among responses, with the highest inter-rater agreement in clinical appropriateness (ICC = 0.81) and the lowest in safety assessments (ICC = 0.68). Notably, nine out of 10 responses included a disclaimer indicating the nonprofessional nature of the content, with one omission observed in a high-risk clinical scenario. Conclusion: Although ChatGPT-4.0 provides generally readable responses to orthopaedic trauma questions, readability does not reliably distinguish between accurate and potentially harmful information. These findings highlight the need for expert review when using AI-generated content in clinical communication.