Abstract
Malware continues to pose a significant threat to individuals and businesses of all sizes. Malware analysts have been trying to create an automated solution for detecting malware and usually resort to Machine Learning. They use features extracted statically and dynamically from portable executable (PE) files to train the model, which they then use for malware detection. In this project we created a dataset comprising of 34,055 samples, using the low-level tool called "PExtract". This dataset is aimed for malware detection and contains DOS Header information, Optional Header information, Section Names, DLLs and API calls. Using the string features from the dataset, we trained and evaluated 15 models, including Transformers, Naive Bayes, and a Neural Network model proposed by the state-of-the-art. We reproduced API-MalDetect, as the representation of state-of-the-art, in order to compare it to our proposal. We found that using the ModernBERT model for static malware detection is the best option for string type features. During evaluation, we found that ModernBERT yields the best results, reaching 98.7% accuracy on our test set. API-MalDetect reached peak 91.55% accuracy when trained on Section Names. Conversely, Naive Bayes models consistently underperformed and are therefore not recommended when working with low-level static features. Our contributions include the development of the PExtract tool and dataset, as well as the fine-tuned Transformer model, ModernBERT, which outperforms current state-of-the-art methods.
| Original language | English |
|---|---|
| Title of host publication | 2025 9th Cyber Security in Networking Conference (CSNet) |
| Publisher | IEEE |
| ISBN (Electronic) | 9798331575564 |
| ISBN (Print) | 9798331575571 |
| DOIs | |
| Publication status | Published - 16 Dec 2025 |
| Event | 9th Cyber Security in Networking Conference 2025 - Abu Dhabi, United Arab Emirates Duration: 20 Oct 2025 → 22 Oct 2025 |
Conference
| Conference | 9th Cyber Security in Networking Conference 2025 |
|---|---|
| Abbreviated title | CSNet 2025 |
| Country/Territory | United Arab Emirates |
| City | Abu Dhabi |
| Period | 20/10/25 → 22/10/25 |
Keywords
- Accuracy
- Neural networks
- Feature extraction
- Transformers
- Probabilistic logic
- Malware
- Tokenization
- Bayes methods
- Proposals
- Random forests
- malware detection
- transformers
- malware analysis
- malware dataset
- neural networks
- deep learning