TY - GEN
T1 - Malicious HTML File Prediction
T2 - 15th IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2018
AU - Hess, Samuel
AU - Satam, Pratik
AU - Ditzler, Gregory
AU - Hariri, Salim
N1 - Funding Information:
ACKNOWLEDGEMENTS This work is partly supported by the AFOSR FA9550-18-1-0427, NSF 1624668 and SES-1314631, and Thomson Reuters in the framework of the Partner University Fund (PUF) project.
Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - Cybersecurity plays a critical role in protecting sensitive information and the structural integrity of networked systems. As networked systems continue to expand in numbers as well as in complexity, so does the threat of malicious activity and the necessity for advanced cybersecurity solutions. Furthermore, both the quantity and quality of available data on malicious content as well as the fact that malicious activity continuously evolves makes automated protection systems for this type of environment particularly challenging. Not only is the data quality a concern, but the volume of the data can be quite small for some of the classes. This creates a class imbalance in the data used to train a classifier; however, many classifiers are not well equipped to deal with class imbalance. One such example is detecting malicious HMTL files from static features. Unfortunately, collecting malicious HMTL files is extremely difficult and can be quite noisy from HTML files being mislabeled. This paper evaluates a specific application that is afflicted by these modern cybersecurity challenges: detection of malicious HTML files. Previous work presented a general framework for malicious HTML file classification that we modify in this work to use a χ2 feature selection technique and synthetic minority oversampling technique (SMOTE). We experiment with different classifiers (i.e., AdaBoost, Gentle-Boost, RobustBoost, RusBoost, and Random Forest) and a pure detection model (i.e., Isolation Forest). We benchmark the different classifiers using SMOTE on a real dataset that contains a limited number of malicious files (40) with respect to the normal files (7,263). It was found that the modified framework performed better than the previous framework's results. However, additional evidence was found to imply that algorithms which train on both the normal and malicious samples are likely overtraining to the malicious distribution. We demonstrate the likely overtraining by determining that a subset of the malicious files, while suspicious, did not come from a malicious source.
AB - Cybersecurity plays a critical role in protecting sensitive information and the structural integrity of networked systems. As networked systems continue to expand in numbers as well as in complexity, so does the threat of malicious activity and the necessity for advanced cybersecurity solutions. Furthermore, both the quantity and quality of available data on malicious content as well as the fact that malicious activity continuously evolves makes automated protection systems for this type of environment particularly challenging. Not only is the data quality a concern, but the volume of the data can be quite small for some of the classes. This creates a class imbalance in the data used to train a classifier; however, many classifiers are not well equipped to deal with class imbalance. One such example is detecting malicious HMTL files from static features. Unfortunately, collecting malicious HMTL files is extremely difficult and can be quite noisy from HTML files being mislabeled. This paper evaluates a specific application that is afflicted by these modern cybersecurity challenges: detection of malicious HTML files. Previous work presented a general framework for malicious HTML file classification that we modify in this work to use a χ2 feature selection technique and synthetic minority oversampling technique (SMOTE). We experiment with different classifiers (i.e., AdaBoost, Gentle-Boost, RobustBoost, RusBoost, and Random Forest) and a pure detection model (i.e., Isolation Forest). We benchmark the different classifiers using SMOTE on a real dataset that contains a limited number of malicious files (40) with respect to the normal files (7,263). It was found that the modified framework performed better than the previous framework's results. However, additional evidence was found to imply that algorithms which train on both the normal and malicious samples are likely overtraining to the malicious distribution. We demonstrate the likely overtraining by determining that a subset of the malicious files, while suspicious, did not come from a malicious source.
UR - http://www.scopus.com/inward/record.url?scp=85061931917&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85061931917&partnerID=8YFLogxK
U2 - 10.1109/AICCSA.2018.8612855
DO - 10.1109/AICCSA.2018.8612855
M3 - Conference contribution
AN - SCOPUS:85061931917
T3 - Proceedings of IEEE/ACS International Conference on Computer Systems and Applications, AICCSA
BT - 2018 IEEE/ACS 15th International Conference on Computer Systems and Applications, AICCSA 2018
PB - IEEE Computer Society
Y2 - 28 October 2018 through 1 November 2018
ER -