Evaluating Word Embeddings for Malware Detection: A Comparative Study of BoW, Word2Vec, and BERT

Supervisor(s):	Paul-Andrei Sava, Chingyu Kao
Status:	finished
Topic:	Others
Author:	Cameron Hirschkorn
Submission:	2025-10-01
Type of Thesis:	Bachelorthesis
Thesis topic in co-operation with the Fraunhofer Institute for Applied and Integrated Security AISEC, Garching
Description This research investigates the effectiveness of modern natural language processing (NLP) embedding methods compared to traditional approaches in processing textual sandbox reports when creating models to classify malicious Windows PE files. As signature-based detection methods become less effective against evolving malware, more sophisticated methods of analyzing text data of malware characteristics must be found in order to extract useful information for model creation. This study evaluates whether advanced word representation techniques can improve classification accuracy based on text input of malware characteristics. We additionally carry this out in a replicable and highly comparable way, using publicly-available datasets and stating the used model parameters. The research compares three approaches: traditional Bag-of-words, Word2vec, and BERT-based models. Using the BODMAS dataset ( 135 000 labeled samples) for training and APTracker ( 60 000 external malicious samples) for external validity testing, we evaluate model performance across metrics including the F1-score, phi coefficient, precision, recall, and AUC measures. Results demonstrate that advanced embedding methods achieve significant improvements on challenging malware samples, with BERT showing a 13.3% relative F1 improvement and 30.8% relative recall improvement over bag-of-words on the external validation set. However, the traditional method maintains advantages in interpretability, as evidenced by the transparent decision tree visualization. The study reveals that model choice depends on operational requirements: embedding methods excel at generalizing well on to novel threats and achieve higher peak performance, while traditional approaches offer greater interpretability with a competitive precision. Cohens’s kappa scores of around 0.6 between the traditional and modern models indicates that methods consider different textual patterns, suggesting a potential complementary use case scenario rather than redundant approaches. These findings contribute to cybersecurity research by demonstrating the potential performance increase of word embeddings model while showcasing the greater interpretability of traditional methods.

Evaluating Word Embeddings for Malware Detection: A Comparative Study of BoW, Word2Vec, and BERT

Description