Can We Trust This Model? Human-Centered Evaluation Metrics For Clinical AI Systems


Kuş A.

Sunum, ss.1-3, 2026

  • Yayın Türü: Diğer Yayınlar / Sunum
  • Basım Tarihi: 2026
  • Sayfa Sayıları: ss.1-3
  • Yozgat Bozok Üniversitesi Adresli: Evet

Özet

Artificial intelligence (AI) systems have achieved impressive performance in a wide range of clinical prediction tasks, from diagnosis to risk stratification. Nevertheless, the deployment of AI in real clinical settings remains limited, largely due to concerns about trust and reliability. For clinicians, trust is not defined solely by predictive accuracy, but by whether a model behaves in a manner that is consistent, appropriately uncertain, and safe across diverse patient populations. Despite this, the evaluation of clinical AI systems continues to rely predominantly on performance-centric metrics such as accuracy, AUC, or F1-score. While these metrics capture discrimination ability, they fail to reflect key factors that shape human trust, including confidence calibration, robustness to data imperfections, and subgroup-level safety. As a result, highly accurate models may still produce outputs that clinicians hesitate to rely on in high-stakes decision-making.

Recent research on trustworthy and responsible AI has emphasized dimensions such as fairness, robustness, calibration, and explainability. However, these dimensions are typically evaluated in isolation and reported from a technical perspective. What remains largely absent is an evaluation paradigm that reflects how clinicians integrate multiple signals when deciding whether to trust a model. In particular, miscalibrated confidence estimates can lead to overconfident predictions , distribution shifts can cause silent performance degradation , and aggregate metrics can mask clinically significant subgroup risks. These limitations highlight a gap between existing evaluation practices and the human-centered notion of trust required for safe clinical adoption.

We propose a human-centered evaluation framework that reframes model assessment around trust-relevant questions rather than single performance indicators. The framework integrates four complementary dimensions aligned with clinical reasoning:

·         Predictive performance, measuring baseline discrimination ability.

·         Confidence calibration, assessing whether predicted probabilities accurately reflect uncertainty.

·         Robustness to distribution shifts, evaluating model stability under noise, missing data, or realistic perturbations.

·         Subgroup safety analysis, identifying performance and calibration disparities across clinically relevant patient groups.

Rather than collapsing these aspects into a single opaque score, we present them as a transparent trustworthiness scorecard that supports human interpretation and risk-aware decision-making. We are currently applying this evaluation framework to clinical risk prediction tasks using tabular healthcare datasets. Baseline models, including logistic regression, gradient-boosted trees, and neural networks, are evaluated under identical conditions. Preliminary results suggest that models with similar AUC values can differ substantially in calibration quality, robustness to perturbations, and subgroup behavior. These differences, while often invisible to standard metrics, are critical from a trust and safety perspective.

Finally, By shifting the evaluation focus from “How accurate is the model?” to “How trustworthy is the model for human decision-makers?”, this work aims to bridge the gap between technical performance and clinical adoption. Future work will extend robustness analyses, refine subgroup safety assessments, and investigate how human-centered evaluation signals influence clinician confidence and reliance on AI systems.