Identifying and Categorizing Malicious Content on Paste Sites: A Neural Topic Modeling Approach

Tala Vahedi, Benjamin Ampel, Sagar Samtani, Hsinchun Chen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Scopus citations

Abstract

Malicious cyber activities impose substantial costs on the U.S. economy and global markets. Cyber-criminals often use information-sharing social media platforms such as paste sites (e.g., Pastebin) to share vast amounts of plain text content related to Personally Identifiable Information (PII), credit card numbers, exploit code, malware, and other sensitive content. Paste sites can provide targeted Cyber Threat Intelligence (CTI) about potential threats and prior breaches. In this research, we propose a novel Bidirectional Encoder Representation from Transformers (BERT) with Latent Dirichlet Allocation (LDA) model to categorize pastes automatically. Our proposed BERT-LDA model leverages a neural network transformer architecture to capture sequential dependencies when representing each sentence in a paste. BERT-LDA replaces the Bag-of-Words (BoW) approach in the conventional LDA with a Bag-of-Labels (BoL) that encompasses class labels at the sequence level. We compared the performance of the proposed BERT-LDA against the conventional LDA and BERT-LDA variants (e.g., GPT2-LDA) on 4,254,453 pastes from three paste sites. Experiment results indicate that the proposed BERT-LDA outperformed the standard LDA and each BERT-LDA variant in terms of perplexity on each paste site. Results of our BERT-LDA case study suggest that significant content relating to hacker community activities, malicious code, network and website vulnerabilities, and PII are shared on paste sites. The insights provided by this study could be used by organizations to proactively mitigate potential damage on their infrastructure.

Original languageEnglish (US)
Title of host publicationProceedings - 2021 IEEE International Conference on Intelligence and Security Informatics, ISI 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781665438384
DOIs
StatePublished - 2021
Event19th Annual IEEE International Conference on Intelligence and Security Informatics, ISI 2021 - Virtual, Online, United States
Duration: Nov 2 2021Nov 3 2021

Publication series

NameProceedings - 2021 IEEE International Conference on Intelligence and Security Informatics, ISI 2021

Conference

Conference19th Annual IEEE International Conference on Intelligence and Security Informatics, ISI 2021
Country/TerritoryUnited States
CityVirtual, Online
Period11/2/2111/3/21

Keywords

  • BERT
  • Paste sites
  • Pastebin
  • cyber threat intelligence
  • exploit code
  • malicious content
  • topic modeling
  • transformers

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Information Systems
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'Identifying and Categorizing Malicious Content on Paste Sites: A Neural Topic Modeling Approach'. Together they form a unique fingerprint.

Cite this