TY - GEN
T1 - A Generative Adversarial Learning Framework for Breaking Text-Based CAPTCHA in the Dark Web
AU - Zhang, Ning
AU - Ebrahimi, Mohammadreza
AU - Li, Weifeng
AU - Chen, Hsinchun
N1 - Funding Information:
*: Corresponding author Acknowledgments: This material is based upon work supported by the National Science Foundation (NSF) under Secure and Trustworthy Cyberspace (grant No. 1936370), Cybersecurity Innovation for Cyberinfrastruc-ture (grant No. 1917117), and Cybersecurity Scholarship-for-Service (grant No. 1921485).
Publisher Copyright:
© 2020 IEEE.
PY - 2020/11/9
Y1 - 2020/11/9
N2 - Cyber threat intelligence (CTI) necessitates automated monitoring of dark web platforms (e.g., Dark Net Markets and carding shops) on a large scale. While there are existing methods for collecting data from the surface web, large-scale dark web data collection is commonly hindered by anti-crawling measures. Text-based CAPTCHA serves as the most prohibitive type of these measures. Text-based CAPTCHA requires the user to recognize a combination of hard-to-read characters. Dark web CAPTCHA patterns are intentionally designed to have additional background noise and variable character length to prevent automated CAPTCHA breaking. Existing CAPTCHA breaking methods cannot remedy these challenges and are therefore not applicable to the dark web. In this study, we propose a novel framework for breaking text-based CAPTCHA in the dark web. The proposed framework utilizes Generative Adversarial Network (GAN) to counteract dark web-specific background noise and leverages an enhanced character segmentation algorithm. Our proposed method was evaluated on both benchmark and dark web CAPTCHA testbeds. The proposed method significantly outperformed the state-of-the-art baseline methods on all datasets, achieving over 92.08% success rate on dark web testbeds. Our research enables the CTI community to develop advanced capabilities of large-scale dark web monitoring.
AB - Cyber threat intelligence (CTI) necessitates automated monitoring of dark web platforms (e.g., Dark Net Markets and carding shops) on a large scale. While there are existing methods for collecting data from the surface web, large-scale dark web data collection is commonly hindered by anti-crawling measures. Text-based CAPTCHA serves as the most prohibitive type of these measures. Text-based CAPTCHA requires the user to recognize a combination of hard-to-read characters. Dark web CAPTCHA patterns are intentionally designed to have additional background noise and variable character length to prevent automated CAPTCHA breaking. Existing CAPTCHA breaking methods cannot remedy these challenges and are therefore not applicable to the dark web. In this study, we propose a novel framework for breaking text-based CAPTCHA in the dark web. The proposed framework utilizes Generative Adversarial Network (GAN) to counteract dark web-specific background noise and leverages an enhanced character segmentation algorithm. Our proposed method was evaluated on both benchmark and dark web CAPTCHA testbeds. The proposed method significantly outperformed the state-of-the-art baseline methods on all datasets, achieving over 92.08% success rate on dark web testbeds. Our research enables the CTI community to develop advanced capabilities of large-scale dark web monitoring.
KW - automated CAPTCHA breaking
KW - cyber threat intelligence
KW - dark web
KW - generative adversarial networks
UR - http://www.scopus.com/inward/record.url?scp=85098937802&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098937802&partnerID=8YFLogxK
U2 - 10.1109/ISI49825.2020.9280537
DO - 10.1109/ISI49825.2020.9280537
M3 - Conference contribution
AN - SCOPUS:85098937802
T3 - Proceedings - 2020 IEEE International Conference on Intelligence and Security Informatics, ISI 2020
BT - Proceedings - 2020 IEEE International Conference on Intelligence and Security Informatics, ISI 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 18th IEEE International Conference on Intelligence and Security Informatics, ISI 2020
Y2 - 9 November 2020 through 10 November 2020
ER -