2025 9th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Türkiye, 14 - 16 Kasım 2025, ss.1-6, (Tam Metin Bildiri)
Malicious PDF files pose a significant threat to digital security. These files can compromise sensitive information and disrupt system operations. Detecting and classifying such files is therefore critical to maintaining cybersecurity. In this study, a systematic approach is introduced to effectively categorize PDF documents. From a dataset of roughly 30,000 PDF files, 43 structural and general features were identified. The dataset was analyzed using a variety of machine learning and deep learning models after being digitized using TF IDF, N-gram, and Word2Vec techniques.Results from machine learning models show high performance, with the Support Vector Machine achieving 0.9967 accuracy using TF-IDF and the Decision Tree reaching 0.9966 accuracy with the count vectorizer and N-gram. For deep learning models, CNN achieved up to 0.9967 accuracy with Count Vectorizer and N-gram, while BiLSTM and GRU also demonstrated high performance using Word2Vec. These results indicate that the proposed approach provides a reliable and effective detection of PDF-based malware.