TY - GEN
T1 - Scaling a neyman-pearson subset selection approach via heuristics for mining massive data
AU - Ditzler, Gregory
AU - Austen, Matthew
AU - Rosen, Gail
AU - Polikar, Robi
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2015/1/13
Y1 - 2015/1/13
N2 - Feature subset selection is an important step towards producing a classifier that relies only on relevant features, while keeping the computational complexity of the classifier low. Feature selection is also used in making inferences on the importance of attributes, even when classification is not the ultimate goal. For example, in bioinformatics and genomics feature subset selection is used to make inferences between the variables that best describe multiple populations. Unfortunately, many feature selection algorithms require the subset size to be specified a priori, but knowing how many variables to select is typically a nontrivial task. Other approaches rely on a specific variable subset selection framework to be used. In this work, we examine an approach to feature subset selection works with a generic variable selection algorithm, and our approach provides statistical inference on the number of features that are relevant, which may be unknown to the generic variable selection algorithm. This work extends our previous implementation of a Neyman-Pearson feature selection (NPFS) hypothesis test, which acts as a meta-subset selection algorithm. Specifically, we examine the conservativeness of the NPFS approach by biasing the hypothesis test, and examine other heuristics for NPFS. We include results from carefully designed synthetic datasets. Furthermore, we demonstrate the NPFS's ability to perform on data of a massive scale.
AB - Feature subset selection is an important step towards producing a classifier that relies only on relevant features, while keeping the computational complexity of the classifier low. Feature selection is also used in making inferences on the importance of attributes, even when classification is not the ultimate goal. For example, in bioinformatics and genomics feature subset selection is used to make inferences between the variables that best describe multiple populations. Unfortunately, many feature selection algorithms require the subset size to be specified a priori, but knowing how many variables to select is typically a nontrivial task. Other approaches rely on a specific variable subset selection framework to be used. In this work, we examine an approach to feature subset selection works with a generic variable selection algorithm, and our approach provides statistical inference on the number of features that are relevant, which may be unknown to the generic variable selection algorithm. This work extends our previous implementation of a Neyman-Pearson feature selection (NPFS) hypothesis test, which acts as a meta-subset selection algorithm. Specifically, we examine the conservativeness of the NPFS approach by biasing the hypothesis test, and examine other heuristics for NPFS. We include results from carefully designed synthetic datasets. Furthermore, we demonstrate the NPFS's ability to perform on data of a massive scale.
KW - Neyman-Pearson
KW - feature subset selection
UR - http://www.scopus.com/inward/record.url?scp=84925070563&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84925070563&partnerID=8YFLogxK
U2 - 10.1109/CIDM.2014.7008701
DO - 10.1109/CIDM.2014.7008701
M3 - Conference contribution
AN - SCOPUS:84925070563
T3 - IEEE SSCI 2014 - 2014 IEEE Symposium Series on Computational Intelligence - CIDM 2014: 2014 IEEE Symposium on Computational Intelligence and Data Mining, Proceedings
SP - 439
EP - 445
BT - IEEE SSCI 2014 - 2014 IEEE Symposium Series on Computational Intelligence - CIDM 2014
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 5th IEEE Symposium on Computational Intelligence and Data Mining, CIDM 2014
Y2 - 9 December 2014 through 12 December 2014
ER -