SCIENTIFIC REPORTS, ss.1-26, 2025 (SCI-Expanded, Scopus)
This study introduces a hybrid deep learning model integrating EfficientNet-B3 and Vision Transformer with an Attention Fusion mechanism for automated colon cancer detection using the Kvasir endoscopic dataset. The model leverages EfficientNet-B3’s strength in capturing fine-grained local textures and Vision Transformer’s ability to model global contextual relationships. A multi-head attention-based fusion block harmonizes these features, achieving comprehensive representations and enhanced classification stability. Model optimization was guided by the Matthews Correlation Coefficient (MCC), alongside evaluations of accuracy, F1-score, and Brier Score. Experimental results demonstrate a 96.2% accuracy and an MCC of 0.961, surpassing standalone baselines and existing benchmark architectures. Cross-validation confirmed robust generalization, while Grad-CAM analyses improved interpretability by visualizing salient histopathological regions influencing predictions. Despite slight overfitting tendencies, the model maintained strong performance across all eight image classes. These findings highlight the model’s ability to address limitations of single-architecture approaches by combining local and global feature extraction, offering rapid, objective, and reliable diagnostic support. The proposed framework shows significant promise for integration into computer-aided colonoscopy systems, paving the way for enhanced clinical diagnostics and reduced pathologist workload through AI-driven precision medicine.