TY - GEN
T1 - A lexical approach for classifying malicious URLs
AU - Darling, Michael
AU - Heileman, Greg
AU - Gressel, Gilad
AU - Ashok, Aravind
AU - Poornachandran, Prabaharan
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/9/2
Y1 - 2015/9/2
N2 - Given the continuous growth of malicious activities on the internet, there is a need for intelligent systems to identify malicious web pages. It has been shown that URL analysis is an effective tool for detecting phishing, malware, and other attacks. Previous studies have performed URL classification using a combination of lexical features, network traffic, hosting information, and other strategies. These approaches require time-intensive lookups which introduce significant delay in real-time systems. In this paper, we describe a lightweight approach for classifying malicious web pages using URL lexical analysis alone. Our goal is to explore the upper-bound of the classification accuracy of a purely lexical approach. We also aim to develop a scalable approach which could be used in a real-time system. We develop a classification system based on lexical analysis of URLs. It correctly classifies URLs of malicious web pages with 99.1% accuracy, a 0.4% false positive rate, an F1-Score of 98.7, and 0.62 milliseconds on average. Our method also outperforms similar approaches when classifying out-of-sample data.
AB - Given the continuous growth of malicious activities on the internet, there is a need for intelligent systems to identify malicious web pages. It has been shown that URL analysis is an effective tool for detecting phishing, malware, and other attacks. Previous studies have performed URL classification using a combination of lexical features, network traffic, hosting information, and other strategies. These approaches require time-intensive lookups which introduce significant delay in real-time systems. In this paper, we describe a lightweight approach for classifying malicious web pages using URL lexical analysis alone. Our goal is to explore the upper-bound of the classification accuracy of a purely lexical approach. We also aim to develop a scalable approach which could be used in a real-time system. We develop a classification system based on lexical analysis of URLs. It correctly classifies URLs of malicious web pages with 99.1% accuracy, a 0.4% false positive rate, an F1-Score of 98.7, and 0.62 milliseconds on average. Our method also outperforms similar approaches when classifying out-of-sample data.
UR - http://www.scopus.com/inward/record.url?scp=84948444296&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84948444296&partnerID=8YFLogxK
U2 - 10.1109/HPCSim.2015.7237040
DO - 10.1109/HPCSim.2015.7237040
M3 - Conference contribution
AN - SCOPUS:84948444296
T3 - Proceedings of the 2015 International Conference on High Performance Computing and Simulation, HPCS 2015
SP - 195
EP - 202
BT - Proceedings of the 2015 International Conference on High Performance Computing and Simulation, HPCS 2015
A2 - Smari, Waleed W.
A2 - Zeljkovic, Vesna
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 13th International Conference on High Performance Computing and Simulation, HPCS 2015
Y2 - 20 July 2015 through 24 July 2015
ER -