Are screening methods useful in feature selection? An empirical study

English version

Autoři: Mingyuan Wang ^aff001; Adrian Barbu ^aff001
Působiště autorů: Statistics Department, Florida State University, Tallahassee, Florida, United States of America ^aff001
Vyšlo v časopise: PLoS ONE 14(9)
Kategorie: Research Article
prolekare.web.journal.doi_sk: https://doi.org/10.1371/journal.pone.0220842

Souhrn

Filter or screening methods are often used as a preprocessing step for reducing the number of variables used by a learning algorithm in obtaining a classification or regression model. While there are many such filter methods, there is a need for an objective evaluation of these methods. Such an evaluation is needed to compare them with each other and also to answer whether they are at all useful, or a learning algorithm could do a better job without them. For this purpose, many popular screening methods are partnered in this paper with three regression learners and five classification learners and evaluated on ten real datasets to obtain accuracy criteria such as R-square and area under the ROC curve (AUC). The obtained results are compared through curve plots and comparison tables in order to find out whether screening methods help improve the performance of learning algorithms and how they fare with each other. Our findings revealed that the screening methods were useful in improving the prediction of the best learner on two regression and two classification datasets out of the ten datasets evaluated.

Klíčová slova:

Biology and life sciences – Physical sciences – Research and analysis methods – Neuroscience – Cognitive science – Cognitive psychology – Learning – Human learning – Learning and memory – Psychology – Social sciences – Computer and information sciences – Mathematics – Simulation and modeling – Statistics – Mathematical and statistical techniques – Statistical methods – Applied mathematics – Algorithms – Cognition – Memory – Face recognition – Perception – Machine learning algorithms – Boosting algorithms – Artificial intelligence – Machine learning – Learning curves

Zdroje

1. Davis JC, Sampson RJ. Statistics and data analysis in geology. vol. 646. Wiley New York et al.; 1986.

2. Lewis DD. Feature selection and feature extraction for text categorization. In: Proceedings of the workshop on Speech and Natural Language. Association for Computational Linguistics; 1992. p. 212–217.

3. Kira K, Rendell LA. The feature selection problem: Traditional methods and a new algorithm. In: AAAI. vol. 2; 1992. p. 129–134.

4. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological). 1996; p. 267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x

5. Han C, Chris D, Fu H. Minimum redundancy maximum relevance feature selection [J]. IEEE Intelligent Systems. 2005;20(6):70–71.

6. Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, et al. Feature selection: A data perspective. ACM Computing Surveys (CSUR). 2017;50(6):94. doi: 10.1145/3136625

7. Tang J, Alelyani S, Liu H. Feature selection for classification: A review. Data Classification: Algorithms and Applications. 2014; p. 37.

8. Chandrashekar G, Sahin F. A survey on feature selection methods. Computers & Electrical Engineering. 2014;40(1):16–28. doi: 10.1016/j.compeleceng.2013.11.024

9. Jović A, Brkić K, Bogunović N. A review of feature selection methods with applications. In: 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). IEEE; 2015. p. 1200–1205.

10. Cai J, Luo J, Wang S, Yang S. Feature selection in machine learning: A new perspective. Neurocomputing. 2018;300 : 70–79. doi: 10.1016/j.neucom.2017.11.077

11. Li Y, Li T, Liu H. Recent advances in feature selection and its applications. Knowledge and Information Systems. 2017;53(3):551–577. doi: 10.1007/s10115-017-1059-8

12. Urbanowicz RJ, Olson RS, Schmitt P, Meeker M, Moore JH. Benchmarking relief-based feature selection methods for bioinformatics data mining. Journal of biomedical informatics. 2018;85 : 168–188. doi: 10.1016/j.jbi.2018.07.015 30030120

13. Alelyani S, Tang J, Liu H. Feature Selection for Clustering: A Review. Data Clustering: Algorithms and Applications. 2013;29 : 110–121.

14. Talavera L. An evaluation of filter and wrapper methods for feature selection in categorical clustering. Advances in Intelligent Data Analysis VI. 2005; p. 742–742.

15. Masoudi-Sobhanzadeh Y, Motieghader H, Masoudi-Nejad A. FeatureSelect: a software for feature selection based on machine learning approaches. BMC bioinformatics. 2019;20(1):170. doi: 10.1186/s12859-019-2754-0 30943889

16. Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Wang Y, et al. iFeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics. 2018;34(14):2499–2502. doi: 10.1093/bioinformatics/bty140 29528364

17. Guyon I, Elisseeff A. An introduction to variable and feature selection. Journal of machine learning research. 2003;3(Mar):1157–1182.

18. Sánchez-Maroño N, Alonso-Betanzos A, Tombilla-Sanromán M. Filter methods for feature selection–a comparative study. Intelligent Data Engineering and Automated Learning-IDEAL 2007. 2007; p. 178–187.

19. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. bioinformatics. 2007;23(19):2507–2517. doi: 10.1093/bioinformatics/btm344 17720704

20. Robnik-Šikonja M, Kononenko I. An adaptation of Relief for attribute estimation in regression. In: Machine Learning: Proceedings of the Fourteenth International Conference (ICML’97); 1997. p. 296–304.

21. Barbu A, She Y, Ding L, Gramajo G. Feature Selection with Annealing for Computer Vision and Big Data Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017;39(2):272–286. doi: 10.1109/TPAMI.2016.2544315 27019473

22. Liu H, Setiono R. Chi2: Feature selection and discretization of numeric attributes. In: Tools with artificial intelligence, 1995. proceedings., seventh international conference on. IEEE; 1995. p. 388–391.

23. Duda RO, Hart PE, Stork DG. Pattern classification. John Wiley & Sons; 2012.

24. Gini C. Variability and mutability, contribution to the study of statistical distribution and relaitons. Studi Economico-Giuricici della R. 1912;.

25. Wang S, Yehya N, Schadt EE, Wang H, Drake TA, Lusis AJ. Genetic and genomic analysis of a fat mass trait with complex inheritance reveals marked sex specificity. PLoS genetics. 2006;2(2):e15. doi: 10.1371/journal.pgen.0020015 16462940

26. Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, et al. Toward a Shared Vision for Cancer Genomic Data. New England Journal of Medicine. 2016;375(12):1109–1112. doi: 10.1056/NEJMp1607591 27653561

27. Torres-Sospedra J, Montoliu R, Martínez-Usó A, Avariento JP, Arnau TJ, Benedito-Bordonau M, et al. UJIIndoorLoc: A new multi-building and multi-floor database for WLAN fingerprint-based indoor localization problems. In: Indoor Positioning and Indoor Navigation (IPIN), 2014 International Conference on. IEEE; 2014. p. 261–270.

28. Rothe R, Timofte R, Gool LV. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision (IJCV). 2016;.

29. Ivanciuc O. CoEPrA 2006 Round 3 Comparative Evaluation of Prediction Algorithms; 2006. Available from: http://www.coepra.org/.

30. Guyon I, Gunn S, Ben-Hur A, Dror G. Result analysis of the NIPS 2003 feature selection challenge. In: Advances in neural information processing systems; 2005. p. 545–552.

31. Spira A, Beane JE, Shah V, Steiling K, Liu G, Schembri F, et al. Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nature medicine. 2007;13(3):361. doi: 10.1038/nm1556 17334370

32. Freije WA, Castro-Vargas FE, Fang Z, Horvath S, Cloughesy T, Liau LM, et al. Gene expression profiling of gliomas strongly predicts survival. Cancer research. 2004;64(18):6503–6510. doi: 10.1158/0008-5472.CAN-04-0452 15374961

33. Lichman M. UCI Machine Learning Repository; 2013. Available from: http://archive.ics.uci.edu/ml.

34. Parkhi OM, Vedaldi A, Zisserman A, et al. Deep Face Recognition. In: BMVC. vol. 1; 2015. p. 6.

35. MATLAB Release 2016b; 2016.

36. Barbu A. Feature Selection with Annealing Code; 2017. Available from: https://github.com/barbua/FSA.

37. Feature Selection Algorithm at Arizona State University;. Available from: http://featureselection.asu.edu/old/software.php.

38. Nguyen (2014) XV. Information Theoretic Feature Selection, version 1.1; Updated 07 Jul 2014. Available from: https://www.mathworks.com/matlabcentral/fileexchange/47129-information-theoretic-feature-selection.

Are screening methods useful in feature selection? An empirical study

Souhrn

Klíčová slova:

Zdroje

PLOS One