Feature selection for high-dimensional imbalanced data

Liuzhi Yin, Yong Ge, Keli Xiao, Xuehua Wang, Xiaojun Quan

Research output: Contribution to journalArticlepeer-review

112 Scopus citations

Abstract

Given its importance, the problem of classification in imbalanced data has attracted great attention in recent years. However, few efforts have been made to develop feature selection techniques for the classification of imbalanced data. This paper thus fills this critical void by introducing two approaches for the feature selection of high-dimensional imbalanced data. To this end, after introducing three traditional methods, we study and illustrate the challenges of feature selection in imbalanced data with Bayesian learning. Indeed, we reveal that the samples in the larger classes have a dominant influence on these feature selection methods. However, the samples in rare classes are essential for the learning performances of rare classes. Based on these observations, we provide a new feature selection approach based on class decomposition. Specifically, we partition the large classes into relatively smaller pseudo-subclasses and generate the pseudo-class labels accordingly. Feature selection is then performed on the new decomposed data for computing the goodness measurement of features. In addition, we also introduce a Hellinger distance-based method for feature selection. Hellinger distance is a measure of distribution divergence, which is strongly skew insensitive as the class prior information is not involved for computing the distance. Finally, we theoretically show the effectiveness of these two approaches with Bayesian learning on synthetic data. We also test and compare the performances of the proposed feature-selection methods on some real-world data sets. The experimental results show that both decomposition-based and Hellinger distance-based methods can outperform existing feature-selection methods with a clear margin on imbalanced data.

Original languageEnglish (US)
Pages (from-to)3-11
Number of pages9
JournalNeurocomputing
Volume105
DOIs
StatePublished - Apr 1 2013
Externally publishedYes

Keywords

  • AUC
  • F-measure
  • Feature selection
  • Hellinger distance
  • Imbalanced data

ASJC Scopus subject areas

  • Computer Science Applications
  • Cognitive Neuroscience
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Feature selection for high-dimensional imbalanced data'. Together they form a unique fingerprint.

Cite this