Description
		This research investigates the effectiveness of modern natural language processing
(NLP) embedding methods compared to traditional approaches in processing textual
sandbox reports when creating models to classify malicious Windows PE files. As
signature-based detection methods become less effective against evolving malware,
more sophisticated methods of analyzing text data of malware characteristics must be
found in order to extract useful information for model creation. This study evaluates
whether advanced word representation techniques can improve classification accuracy
based on text input of malware characteristics. We additionally carry this out in a
replicable and highly comparable way, using publicly-available datasets and stating
the used model parameters. The research compares three approaches: traditional
Bag-of-words, Word2vec, and BERT-based models. Using the BODMAS dataset ( 135
000 labeled samples) for training and APTracker ( 60 000 external malicious samples)
for external validity testing, we evaluate model performance across metrics including
the F1-score, phi coefficient, precision, recall, and AUC measures. Results demonstrate
that advanced embedding methods achieve significant improvements on challenging
malware samples, with BERT showing a 13.3% relative F1 improvement and 30.8%
relative recall improvement over bag-of-words on the external validation set. However,
the traditional method maintains advantages in interpretability, as evidenced by the
transparent decision tree visualization. The study reveals that model choice depends
on operational requirements: embedding methods excel at generalizing well on to
novel threats and achieve higher peak performance, while traditional approaches
offer greater interpretability with a competitive precision. Cohens’s kappa scores of
around 0.6 between the traditional and modern models indicates that methods consider
different textual patterns, suggesting a potential complementary use case scenario
rather than redundant approaches. These findings contribute to cybersecurity research
by demonstrating the potential performance increase of word embeddings model while
showcasing the greater interpretability of traditional methods. 
	       |