TY - GEN
T1 - Linking Personally Identifiable Information from the Dark Web to the Surface Web
T2 - 20th IEEE International Conference on Data Mining Workshops, ICDMW 2020
AU - Lin, Fangyu
AU - Liu, Yizhi
AU - Ebrahimi, Mohammadreza
AU - Ahmad-Post, Zara
AU - Hu, James Lee
AU - Xin, Jingyu
AU - Samtani, Sagar
AU - Li, Weifeng
AU - Chen, Hsinchun
N1 - Funding Information:
ACKNOWLEDGEMENT This material is based upon work supported by the National Science Foundation (NSF) under Secure and Trustworthy Cyberspace (grant No. 1936370), Cybersecurity Innovation for Cyber Infrastructure (grant No. 1917117), and CyberCorps Scholarship-for-Service (grant No. 1921485).
Publisher Copyright:
© 2020 IEEE.
PY - 2020/11
Y1 - 2020/11
N2 - The information privacy of the Internet users has become a major societal concern. The rapid growth of online services increases the risk of unauthorized access to Personally Identifiable Information (PII) of at-risk populations, who are unaware of their PII exposure. To proactively identify online at-risk populations and increase their privacy awareness, it is crucial to conduct a holistic privacy risk assessment across the internet. Current privacy risk assessment studies are limited to a single platform within either the surface web or the dark web. A comprehensive privacy risk assessment requires matching exposed PII on heterogeneous online platforms across the surface web and the dark web. However, due to the incompleteness and inaccuracy of PII records in each platform, linking the exposed PII to users is a non-trivial task. While Entity Resolution (ER) techniques can be used to facilitate this task, they often require ad-hoc, manual rule development and feature engineering. Recently, Deep Learning (DL)-based ER has outperformed manual entity matching rules by automatically extracting prominent features from incomplete or inaccurate records. In this study, we enhance the existing privacy risk assessment with a DL-based ER method, namely Multi-Context Attention (MCA), to comprehensively evaluate individuals' PII exposure across the different online platforms in the dark web and surface web. Evaluation against benchmark ER models indicates the efficacy of MCA. Using MCA on a random sample of data breach victims in the dark web, we are able to identify 4.3% of the victims on the surface web platforms and calculate their privacy risk scores.
AB - The information privacy of the Internet users has become a major societal concern. The rapid growth of online services increases the risk of unauthorized access to Personally Identifiable Information (PII) of at-risk populations, who are unaware of their PII exposure. To proactively identify online at-risk populations and increase their privacy awareness, it is crucial to conduct a holistic privacy risk assessment across the internet. Current privacy risk assessment studies are limited to a single platform within either the surface web or the dark web. A comprehensive privacy risk assessment requires matching exposed PII on heterogeneous online platforms across the surface web and the dark web. However, due to the incompleteness and inaccuracy of PII records in each platform, linking the exposed PII to users is a non-trivial task. While Entity Resolution (ER) techniques can be used to facilitate this task, they often require ad-hoc, manual rule development and feature engineering. Recently, Deep Learning (DL)-based ER has outperformed manual entity matching rules by automatically extracting prominent features from incomplete or inaccurate records. In this study, we enhance the existing privacy risk assessment with a DL-based ER method, namely Multi-Context Attention (MCA), to comprehensively evaluate individuals' PII exposure across the different online platforms in the dark web and surface web. Evaluation against benchmark ER models indicates the efficacy of MCA. Using MCA on a random sample of data breach victims in the dark web, we are able to identify 4.3% of the victims on the surface web platforms and calculate their privacy risk scores.
KW - Dark web
KW - Data breach
KW - Data collection
KW - PII
KW - Privacy
KW - Surface web
UR - http://www.scopus.com/inward/record.url?scp=85101346894&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85101346894&partnerID=8YFLogxK
U2 - 10.1109/ICDMW51313.2020.00072
DO - 10.1109/ICDMW51313.2020.00072
M3 - Conference contribution
AN - SCOPUS:85101346894
T3 - IEEE International Conference on Data Mining Workshops, ICDMW
SP - 488
EP - 495
BT - Proceedings - 20th IEEE International Conference on Data Mining Workshops, ICDMW 2020
A2 - Di Fatta, Giuseppe
A2 - Sheng, Victor
A2 - Cuzzocrea, Alfredo
A2 - Zaniolo, Carlo
A2 - Wu, Xindong
PB - IEEE Computer Society
Y2 - 17 November 2020 through 20 November 2020
ER -