Nie jesteś zalogowany | Zaloguj się

Feature selection with RAFS and STIG for knowledge discovery and machine learning model building in binary classification tasks

Prelegent(ci)
Radosław Piliszek
Afiliacja
Uniwersytet w Białymstoku
Termin
2 czerwca 2023 16:15
Informacje na temat wydarzenia
4060 & online: meet.google.com/jbj-tdsr-aop
Seminarium
Seminarium badawcze „Systemy Inteligentne”

This presentation is to report PhD thesis results ("Development of methods for feature selection based on information theory").
Dimensionality reduction is an important step in knowledge discovery and machine learning.
This study is focused on the feature selection branch of dimensionality reduction since it preserves the original, interpretable features, what is crucial for certain applications (e.g., biomedical).
Furthermore, the goal is to find the smallest (minimal-optimal) set of most informative features that generalise to the population in binary classification datasets of tens of thousands of features with high levels of correlations between the features.
Additionally, the method is meant not to consult the machine learning model feedback, thus remaining model-neutral.
To this end, a novel feature selection method (Robust Aggregative Feature Selection - RAFS) and a supplementary feature dissimilarity measure (Symmetric Target Information Gain - STIG) considering the binary decision variable are proposed.
The proposed method utilises cross-validation, all-relevant feature selection filtering, feature clustering, feature ranking and top feature popularity counting.
The proposed similarity measure is rooted in information theory.
The method is applied in external cross-validation on a real-world dataset, presenting diverse measures of dissimilarity between the features.
The feature selection results are validated using the AUC metric obtained from machine learning models built on the selected features as well as using feature selection stability measures (Jaccard index, Kuncheva index, Consistency Score).
The method is compared against state-of-the-art methods (RFE and mRMR) and is shown to achieve the highest AUC values with the smallest number of features selected and the highest stability scores.