Sunum, ss.1-3, 2026
Artificial
intelligence (AI) systems have achieved impressive performance in a wide range
of clinical prediction tasks, from diagnosis to risk stratification.
Nevertheless, the deployment of AI in real clinical settings remains limited,
largely due to concerns about trust and reliability. For clinicians, trust
is not defined solely by predictive accuracy, but by whether a model behaves in
a manner that is consistent, appropriately uncertain, and safe across diverse
patient populations. Despite this, the evaluation of clinical AI systems
continues to rely predominantly on performance-centric metrics such as
accuracy, AUC, or F1-score. While these metrics capture discrimination ability,
they fail to reflect key factors that shape human trust, including confidence
calibration, robustness to data imperfections, and subgroup-level safety.
As a result, highly accurate models may still produce outputs that clinicians hesitate
to rely on in high-stakes decision-making.
Recent
research on trustworthy and responsible AI has emphasized dimensions such as
fairness, robustness, calibration, and explainability. However,
these dimensions are typically evaluated in isolation and reported from a
technical perspective. What remains largely absent is an evaluation paradigm
that reflects how clinicians integrate multiple signals when deciding whether
to trust a model. In particular, miscalibrated confidence estimates can lead to
overconfident predictions , distribution shifts can cause silent performance
degradation , and aggregate metrics can mask clinically significant subgroup
risks. These limitations highlight a gap between existing evaluation
practices and the human-centered notion of trust required for safe clinical
adoption.
We
propose a human-centered evaluation framework that reframes model assessment
around trust-relevant questions rather than single performance indicators. The
framework integrates four complementary dimensions aligned with clinical
reasoning:
·
Predictive
performance, measuring baseline discrimination ability.
·
Confidence
calibration, assessing whether predicted probabilities accurately reflect
uncertainty.
·
Robustness to
distribution shifts, evaluating model stability under noise, missing data, or
realistic perturbations.
·
Subgroup safety
analysis, identifying performance and calibration disparities across clinically
relevant patient groups.
Rather
than collapsing these aspects into a single opaque score, we present them as a
transparent trustworthiness scorecard that supports human interpretation and
risk-aware decision-making. We are currently applying this evaluation framework
to clinical risk prediction tasks using tabular healthcare datasets. Baseline
models, including logistic regression, gradient-boosted trees, and neural
networks, are evaluated under identical conditions. Preliminary results suggest
that models with similar AUC values can differ substantially in calibration
quality, robustness to perturbations, and subgroup behavior. These differences,
while often invisible to standard metrics, are critical from a trust and safety
perspective.
Finally,
By shifting the evaluation focus from “How accurate is the model?” to “How
trustworthy is the model for human decision-makers?”, this work aims to bridge
the gap between technical performance and clinical adoption. Future work will
extend robustness analyses, refine subgroup safety assessments, and investigate
how human-centered evaluation signals influence clinician confidence and
reliance on AI systems.