TY - GEN
T1 - Information theoretic feature selection for high dimensional metagenomic data
AU - Ditzler, Gregory
AU - Rosen, Gail
AU - Polikar, Robi
PY - 2012
Y1 - 2012
N2 - Extremely high dimensional data sets are common in genomic classification scenarios, but they are particularly prevalent in metagenomic studies that represent samples as abundances of taxonomic units. Furthermore, the data dimensionality is typically much larger than the number of observations collected for each instance, a phenomenon known as curse of dimensionality, a particularly challenging problem for most machine learning algorithms. The biologists collecting and analyzing data need efficient methods to determine relationships between classes in a data set and the variables that are capable of differentiating between multiple groups in a study. The most common methods of metagenomic data analysis are those characterized by α- and β-diversity tests; however, neither of these tests allow scientists to identify the organisms that are most responsible for differentiating between different categories in a study. In this paper, we present an analysis of information theoretic feature selection methods for improving the classification accuracy with metagenomic data.
AB - Extremely high dimensional data sets are common in genomic classification scenarios, but they are particularly prevalent in metagenomic studies that represent samples as abundances of taxonomic units. Furthermore, the data dimensionality is typically much larger than the number of observations collected for each instance, a phenomenon known as curse of dimensionality, a particularly challenging problem for most machine learning algorithms. The biologists collecting and analyzing data need efficient methods to determine relationships between classes in a data set and the variables that are capable of differentiating between multiple groups in a study. The most common methods of metagenomic data analysis are those characterized by α- and β-diversity tests; however, neither of these tests allow scientists to identify the organisms that are most responsible for differentiating between different categories in a study. In this paper, we present an analysis of information theoretic feature selection methods for improving the classification accuracy with metagenomic data.
UR - http://www.scopus.com/inward/record.url?scp=84877822554&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84877822554&partnerID=8YFLogxK
U2 - 10.1109/GENSIPS.2012.6507749
DO - 10.1109/GENSIPS.2012.6507749
M3 - Conference contribution
AN - SCOPUS:84877822554
SN - 9781467352369
T3 - Proceedings - IEEE International Workshop on Genomic Signal Processing and Statistics
SP - 143
EP - 146
BT - Proceedings 2012 IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS 2012
T2 - 2012 IEEE International Workshop on Genomic Signal Processing and Statistics, GENSIPS 2012
Y2 - 2 December 2012 through 4 December 2012
ER -