TY - GEN
T1 - Identifying and Categorizing Malicious Content on Paste Sites
T2 - 19th Annual IEEE International Conference on Intelligence and Security Informatics, ISI 2021
AU - Vahedi, Tala
AU - Ampel, Benjamin
AU - Samtani, Sagar
AU - Chen, Hsinchun
N1 - Funding Information:
VII. ACKNOWLEDGMENT This work was supported in part by the National Science Foundation under grant numbers DGE-1921485 (SFS), OAC-1917117 (CICI), and CNS-1850362 (SaTC CRII).
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Malicious cyber activities impose substantial costs on the U.S. economy and global markets. Cyber-criminals often use information-sharing social media platforms such as paste sites (e.g., Pastebin) to share vast amounts of plain text content related to Personally Identifiable Information (PII), credit card numbers, exploit code, malware, and other sensitive content. Paste sites can provide targeted Cyber Threat Intelligence (CTI) about potential threats and prior breaches. In this research, we propose a novel Bidirectional Encoder Representation from Transformers (BERT) with Latent Dirichlet Allocation (LDA) model to categorize pastes automatically. Our proposed BERT-LDA model leverages a neural network transformer architecture to capture sequential dependencies when representing each sentence in a paste. BERT-LDA replaces the Bag-of-Words (BoW) approach in the conventional LDA with a Bag-of-Labels (BoL) that encompasses class labels at the sequence level. We compared the performance of the proposed BERT-LDA against the conventional LDA and BERT-LDA variants (e.g., GPT2-LDA) on 4,254,453 pastes from three paste sites. Experiment results indicate that the proposed BERT-LDA outperformed the standard LDA and each BERT-LDA variant in terms of perplexity on each paste site. Results of our BERT-LDA case study suggest that significant content relating to hacker community activities, malicious code, network and website vulnerabilities, and PII are shared on paste sites. The insights provided by this study could be used by organizations to proactively mitigate potential damage on their infrastructure.
AB - Malicious cyber activities impose substantial costs on the U.S. economy and global markets. Cyber-criminals often use information-sharing social media platforms such as paste sites (e.g., Pastebin) to share vast amounts of plain text content related to Personally Identifiable Information (PII), credit card numbers, exploit code, malware, and other sensitive content. Paste sites can provide targeted Cyber Threat Intelligence (CTI) about potential threats and prior breaches. In this research, we propose a novel Bidirectional Encoder Representation from Transformers (BERT) with Latent Dirichlet Allocation (LDA) model to categorize pastes automatically. Our proposed BERT-LDA model leverages a neural network transformer architecture to capture sequential dependencies when representing each sentence in a paste. BERT-LDA replaces the Bag-of-Words (BoW) approach in the conventional LDA with a Bag-of-Labels (BoL) that encompasses class labels at the sequence level. We compared the performance of the proposed BERT-LDA against the conventional LDA and BERT-LDA variants (e.g., GPT2-LDA) on 4,254,453 pastes from three paste sites. Experiment results indicate that the proposed BERT-LDA outperformed the standard LDA and each BERT-LDA variant in terms of perplexity on each paste site. Results of our BERT-LDA case study suggest that significant content relating to hacker community activities, malicious code, network and website vulnerabilities, and PII are shared on paste sites. The insights provided by this study could be used by organizations to proactively mitigate potential damage on their infrastructure.
KW - BERT
KW - Paste sites
KW - Pastebin
KW - cyber threat intelligence
KW - exploit code
KW - malicious content
KW - topic modeling
KW - transformers
UR - http://www.scopus.com/inward/record.url?scp=85123488969&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123488969&partnerID=8YFLogxK
U2 - 10.1109/ISI53945.2021.9624765
DO - 10.1109/ISI53945.2021.9624765
M3 - Conference contribution
AN - SCOPUS:85123488969
T3 - Proceedings - 2021 IEEE International Conference on Intelligence and Security Informatics, ISI 2021
BT - Proceedings - 2021 IEEE International Conference on Intelligence and Security Informatics, ISI 2021
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 2 November 2021 through 3 November 2021
ER -