TY - GEN
T1 - Distilling Contextual Embeddings into A Static Word Embedding for Improving Hacker Forum Analytics
AU - Ampel, Benjamin
AU - Chen, Hsinchun
N1 - Funding Information:
ACKNOWLEDGMENT This work was supported in part by the National Science Foundation under grant numbers DGE-1921485 (SFS), OAC-1917117 (CICI), and CNS-1850362 (SaTC CRII).
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Hacker forums provide malicious actors with a large database of tutorials, goods, and assets to leverage for cyber-attacks. Careful research of these forums can provide tremendous benefit to the cybersecurity community through trend identification and exploit categorization. This study aims to provide a novel static word embedding, Hack2Vec, to improve performance on hacker forum classification tasks. Our proposed Hack2Vec model distills contextual representations from the seminal pre-trained language model BERT to a continuous bag-of-words model to create a highly targeted hacker forum static word embedding. The results of our experimental design indicate that Hack2Vec improves performance over prominent embeddings in accuracy, precision, recall, and F1-score for a benchmark hacker forum classification task.
AB - Hacker forums provide malicious actors with a large database of tutorials, goods, and assets to leverage for cyber-attacks. Careful research of these forums can provide tremendous benefit to the cybersecurity community through trend identification and exploit categorization. This study aims to provide a novel static word embedding, Hack2Vec, to improve performance on hacker forum classification tasks. Our proposed Hack2Vec model distills contextual representations from the seminal pre-trained language model BERT to a continuous bag-of-words model to create a highly targeted hacker forum static word embedding. The results of our experimental design indicate that Hack2Vec improves performance over prominent embeddings in accuracy, precision, recall, and F1-score for a benchmark hacker forum classification task.
KW - Hacker forums
KW - contextual embeddings
KW - knowledge distillation
KW - static word embeddings
KW - text classification
UR - http://www.scopus.com/inward/record.url?scp=85123469448&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123469448&partnerID=8YFLogxK
U2 - 10.1109/ISI53945.2021.9624848
DO - 10.1109/ISI53945.2021.9624848
M3 - Conference contribution
AN - SCOPUS:85123469448
T3 - Proceedings - 2021 IEEE International Conference on Intelligence and Security Informatics, ISI 2021
BT - Proceedings - 2021 IEEE International Conference on Intelligence and Security Informatics, ISI 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 19th Annual IEEE International Conference on Intelligence and Security Informatics, ISI 2021
Y2 - 2 November 2021 through 3 November 2021
ER -