Machine learning algorithm validation with a limited sample size
Autoři:
Andrius Vabalas aff001; Emma Gowen aff002; Ellen Poliakoff aff002; Alexander J. Casson aff001
Působiště autorů:
Materials, Devices and Systems Division, School of Electrical and Electronic Engineering, The University of Manchester, Manchester, England, United Kingdom
aff001; School of Biological Sciences, The University of Manchester, Manchester, England, United Kingdom
aff002
Vyšlo v časopise:
PLoS ONE 14(11)
Kategorie:
Research Article
prolekare.web.journal.doi_sk:
https://doi.org/10.1371/journal.pone.0224365
Souhrn
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
Klíčová slova:
Algorithms – Normal distribution – Neuroimaging – Machine learning – Learning curves – Autism – Kernel functions – Gaussian noise
Zdroje
1. Raudys SJ, Jain AK. Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1991 Mar;13(3):252–264. doi: 10.1109/34.75512
2. Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age PLoS Medicine. 2015 Mar;12(3):e1001779 doi: 10.1371/journal.pmed.1001779 25826379
3. Arbabshirani MR, Plis S, Sui J, Calhoun VD. Single subject prediction of brain disorders in neuroimaging: Promises and pitfalls. NeuroImage. 2016 Jan;145:137–165. doi: 10.1016/j.neuroimage.2016.02.079 27012503
4. Varoquaux G. Cross-validation failure: Small sample sizes lead to large error bars. NeuroImage. 2018 Oct;180:68–77. doi: 10.1016/j.neuroimage.2017.06.061 28655633
5. Combrisson E, Jerbi K. Exceeding chance level by chance: The caveat of theoretical chance levels in brain signal classification and statistical assessment of decoding accuracy. Journal of Neuroscience Methods. 2015 Jul;250:126–36. doi: 10.1016/j.jneumeth.2015.01.010 25596422
6. Kanal L, Chandrasekaran B. On dimensionality and sample size in statistical pattern classification. Pattern Recognition. 1971 Oct;3(3):225–34. doi: 10.1016/0031-3203(71)90013-6
7. Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics. 2006 Dec;7(1):91. doi: 10.1186/1471-2105-7-91 16504092
8. Jain AK, Chandrasekaran B. 39 Dimensionality and sample size considerations in pattern recognition practice. Handbook of Statistics. 1982 Jan;2:835–55. doi: 10.1016/S0169-7161(82)02042-2
9. Cawley GC, Talbot NL. On over-fitting in model selection and subsequent selection bias in performance evaluation. Machine Learning Research. 2010 Jul;11:2079–107.
10. Stone M. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological). 1974 Jan;36(2):111–33.
11. Krstajic D, Buturovic LJ, Leahy DE, Thomas S. Cross-validation pitfalls when selecting and assessing regression and classification models. Journal of Cheminformatics. 2014 Dec;6(1):1–15. doi: 10.1186/1758-2946-6-10
12. Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory. 1992 Jul;144-152.
13. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning. 2002 Jan;46(1-3):389–422. doi: 10.1023/A:1012487302797
14. Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nature Reviews Genetics. 2015 Jun;16(6):321. doi: 10.1038/nrg3920 25948244
15. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007 Oct;23(19):2507–17. doi: 10.1093/bioinformatics/btm344 17720704
16. Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER. Optimal number of features as a function of sample size for various classification rules. Bioinformatics. 2004 Nov;21(8):1509–15. doi: 10.1093/bioinformatics/bti171 15572470
17. Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST). 2011 Apr;2(3):27.
18. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011 Oct;12:2825–30.
19. Devos O, Ruckebusch C, Durand A, Duponchel L, Huvenne JP. Support vector machines (SVM) in near infrared (NIR) spectroscopy: Focus on parameters optimization and model interpretation. Chemometrics and Intelligent Laboratory Systems. 2009 Mar;96(1):27–33. doi: 10.1016/j.chemolab.2008.11.005
20. Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Advances in Bioinformatics. 2015;2015. doi: 10.1155/2015/198363 26170834
21. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A, Benítez JM, Herrera F. A review of microarray datasets and applied feature selection methods. Information Sciences. 2014 Oct;282:111–35. doi: 10.1016/j.ins.2014.05.042
22. Dernoncourt D, Hanczar B, Zucker JD. Analysis of feature selection stability on high dimension and small sample data. Computational Statistics & Data Analysis. 2014 Mar;71:681–93. doi: 10.1016/j.csda.2013.07.012
23. Figueroa RL, Zeng-Treitler Q, Kandula S, Ngo LH. Predicting sample size required for classification performance. BMC Medical Informatics and Decision Making. 2012 Dec;12(1):8. doi: 10.1186/1472-6947-12-8 22336388
24. Mukherjee S, Tamayo P, Rogers S, Rifkin R, Engle A, Campbell C, et al. Estimating dataset size requirements for classifying DNA microarray data. Journal of Computational Biology. 2003 Apr;10(2):119–42. doi: 10.1089/106652703321825928 12804087
25. Beleites C, Neugebauer U, Bocklitz T, Krafft C, Popp J. Sample size planning for classification models. Analytica Chimica Acta. 2013 Jan;760:25–33. doi: 10.1016/j.aca.2012.11.007 23265730
26. Hyde KK, Novack MN, LaHaye N, Parlett-Pelleriti C, Anden R, Dixon DR, et al. Applications of Supervised Machine Learning in Autism Spectrum Disorder Research: a Review. Review Journal of Autism and Developmental Disorders. 2019 Jun;6(2):128–46. doi: 10.1007/s40489-019-00158-x
27. Varoquaux G, Raamana PR, Engemann DA, Hoyos-Idrobo A, Schwartz Y, Thirion B. Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage. 2017 Jan;145:166–79. doi: 10.1016/j.neuroimage.2016.10.038 27989847
28. Bone D, Goodwin MS, Black MP, Lee CC, Audhkhasi K, Narayanan S. Applying machine learning to facilitate autism diagnostics: pitfalls and promises. Journal of Autism and Developmental Disorders. 2015 May;45(5):1121–36. doi: 10.1007/s10803-014-2268-6 25294649
29. Kassraian-Fard P, Matthis C, Balsters JH, Maathuis MH, Wenderoth N. Promises, pitfalls, and basic guidelines for applying machine learning classifiers to psychiatric imaging data, with autism as an example. Frontiers in Psychiatry. 2016 Dec;7:177. doi: 10.3389/fpsyt.2016.00177 27990125
Článok vyšiel v časopise
PLOS One
2019 Číslo 11
- Metamizol jako analgetikum první volby: kdy, pro koho, jak a proč?
- Nejasný stín na plicích – kazuistika
- Masturbační chování žen v ČR − dotazníková studie
- Úspěšná resuscitativní thorakotomie v přednemocniční neodkladné péči
- Dlouhodobá recidiva a komplikace spojené s elektivní operací břišní kýly
Najčítanejšie v tomto čísle
- A daily diary study on maladaptive daydreaming, mind wandering, and sleep disturbances: Examining within-person and between-persons relations
- A 3’ UTR SNP rs885863, a cis-eQTL for the circadian gene VIPR2 and lincRNA 689, is associated with opioid addiction
- A substitution mutation in a conserved domain of mammalian acetate-dependent acetyl CoA synthetase 2 results in destabilized protein and impaired HIF-2 signaling
- Molecular validation of clinical Pantoea isolates identified by MALDI-TOF