The Univariate Flagging Algorithm (UFA): An interpretable approach for predictive modeling
Autoři:
Mallory Sheth aff001; Albert Gerovitch aff001; Roy Welsch aff001; Natasha Markuzon aff002
Působiště autorů:
Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
aff001; The Charles Stark Draper Laboratory, Cambridge, Massachusetts, United States of America
aff002
Vyšlo v časopise:
PLoS ONE 14(10)
Kategorie:
Research Article
prolekare.web.journal.doi_sk:
https://doi.org/10.1371/journal.pone.0223161
Souhrn
In many data classification problems, a number of methods will give similar accuracy. However, when working with people who are not experts in data science such as doctors, lawyers, and judges among others, finding interpretable algorithms can be a critical success factor. Practitioners have a deep understanding of the individual input variables but far less insight into how they interact with each other. For example, there may be ranges of an input variable for which the observed outcome is significantly more or less likely. This paper describes an algorithm for automatic detection of such thresholds, called the Univariate Flagging Algorithm (UFA). The algorithm searches for a separation that optimizes the difference between separated areas while obtaining a high level of support. We evaluate its performance using six sample datasets and demonstrate that thresholds identified by the algorithm align well with published results and known physiological boundaries. We also introduce two classification approaches that use UFA and show that the performance attained on unseen test data is comparable to or better than traditional classifiers when confidence intervals are considered. We identify conditions under which UFA performs well, including applications with large amounts of missing or noisy data, applications with a large number of inputs relative to observations, and applications where incidence of the target is low. We argue that ease of explanation of the results, robustness to missing data and noise, and detection of low incidence adverse outcomes are desirable features for clinical applications that can be achieved with relatively simple classifier, like UFA.
Klíčová slova:
Death rates – Medical doctors – Algorithms – Machine learning algorithms – Machine learning – Support vector machines – Sepsis – Body temperature
Zdroje
1. Donoho D, Jin J. Higher Criticism Thresholding: Optimal Feature Selection when Useful Features are Rare and Weak. Proc Natl Acad Sci U S A. 2008 Sep 30;105(39):14790–5. doi: 10.1073/pnas.0807471105 18815365
2. Dettling M. BagBoosting for tumor classification with gene expression data. Bioinformatics. 2004;20:3583–3593. doi: 10.1093/bioinformatics/bth447 15466910
3. Zhao SD, Parmigiani G, Huttenhower C, Waldron L. Más-o-menos: a simple sign averaging method for discrimination in genomic data analysis. Bioinformatics. 2014 Nov 1;30(21): 3062–3069. doi: 10.1093/bioinformatics/btu488 25061068
4. Lakkaraju H, Bach SH, Leskovec J. Interpretable Decision Sets: A Joint Framework for Description and Prediction. In: Krishnapuram B, editor. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016 Aug 13–17; San Francisco, California. New York: ACM; 2016. p. 1675–84.
5. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York: Springer Science+Business Media; 2009.
6. Witten IH, Frank E. Data mining: practical machine learning tools and techniques. 2nd ed. San Francisco: Morgan Kaufmann Publishers; 2005.
7. Eick CF, Zeidat N, Zhao Z. Supervised clustering–algorithms and benefits. ICTAI 2004: Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence; 2004 Nov 15–17; Boca Raton, FL. Washington DC: IEEE Computer Society; 2004. p. 774–6.
8. Williams B, Mandrekar JN, Mandrekar SJ, Cha SS, Furth AF. Finding optimal cutpoints for continuous covariates with binary and time-to-event outcomes. Rochester (MN): Mayo Clinic, Department of Health Sciences Research; 2006 Jun.
9. Mazumdar M, Glassman JR. Categorizing a prognostic variable: review of methods, code for easy implementation and applications to decision-making about cancer treatments. Stat Med. 2000 Jan 15;19(1): 113–32. doi: 10.1002/(sici)1097-0258(20000115)19:1<113::aid-sim245>3.0.co;2-o 10623917
10. Baum RL, Godt JW. Early warning of rainfall-induced shallow landslides and debris flows in the USA. Landslides. 2010 Sep;7(3): 259–272.
11. Martina MLV, Todini E, Libralon A. A Bayesian decision approach to rainfall thresholds based flood warning. Hydrol Earth Syst Sci. 2006;10: 413–426.
12. Kalil A. Septic shock clinical presentation. Medscape [Internet]. 2014 Oct 20 [cited 2015 Mar 16]. http://emedicine.medscape.com/article/168402-clinical.
13. Dellinger RP, Levy MM, Rhodes A, Annane D, Gerlach H, Opal SM, et al. Surviving sepsis campaign: international guidelines for management of severe sepsis and septic shock: 2012. Crit Care Med. 2013;41(2): 580–637. doi: 10.1097/CCM.0b013e31827e83af 23353941
14. Friedman JH, Fisher NI. Bump hunting in high-dimensional data. Statistics and Computing. 1999; 9:123–143.
15. Rice JA. Mathematical Statistics and Data Analysis. 3rd ed. Duxbury: Brooks/Cole; 2007.
16. Lichman M. Breast Cancer Wisconsin (Original) Data Set; 1992. Database: UCI Machine learning Repository. [Internet] https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)
17. Lichman M. Pima Indians Diabetes Data Set; 2013. Database: UCI Machine learning Repository [Internet]. http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
18. Kapouleas I, Weiss SM. An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. Readings in Machine Learning. 1990; 177–183.
19. Michie D, Spiegelhalter DJ, Taylor CC, editors. Machine learning, neural and statistical classification. New Jersey: Ellis Horwood; 1994.
20. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression. Science. 1999 Oct 15;286(5439):531–7. doi: 10.1126/science.286.5439.531 10521349
21. Broad Institute. Molecular classification of cancer: class discovery and class prediction by gene expression. Database: Cancer Program Legacy Publication Resources [Internet]. http://portals.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=43
22. Saeed M, Villarroel M, Reisner AT, Clifford G, Lehman LW, Moody G, et al. Multiparameter intelligent monitoring in intensive care II (MIMIC-II): A public-access intensive care unit database. Crit Care Med. 2011;39(5):952–960. doi: 10.1097/CCM.0b013e31820a92c6 21283005
23. MedlinePlus [Internet]. National Institutes of Health; c2015 [cited 2015 Apr 6]. http://www.nlm.nih.gov/medlineplus/medlineplus.html
24. R Core Team (2016). R: A language and environment for statistical computing. R Foundation for statistical Computing, Vienna, Austria. https://www.R-project.org
25. Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F, Chang C, et al. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. 2015. R package 1.6–7. https://CRAN.R-project.org/package=e1071
26. Therneau T, Atkinson B, Ripley B. rpart: Recursive Partitioning and Regression Trees. 2015. R package version 4.1–10. https://CRAN.R-project.org/package=rpart
27. Liaw A, Wiener M. Classification and Regression by random forest. 2002. R News 2(3), 18–22.
28. Chleborad AF, Baum RL, Godt JW. Rainfall thresholds for forecasting landslides in the Seattle, Washington, Area-Exceedance and Probability. U.S. Geological Survey Open-File Report. 2006.
29. Yanmin S, Wong AKC, Kamel MS. Classification of imbalanced data: a review. Intern J Pattern Recognit Artif Intell. 2009;23(4):687–719.
30. Weiss GM. Mining with rarity: a unifying framework. ACM SIGKDD Explorations. 2004; 6(1):7–15.
31. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol. 1995;57(1), 289–300.
32. Leskovec Y, Rajaraman A, Ullman J. Mining of Massive Datasets. 2nd ed. Cambridge: Cambridge University Press; 2014.
Článok vyšiel v časopise
PLOS One
2019 Číslo 10
- Metamizol jako analgetikum první volby: kdy, pro koho, jak a proč?
- Nejasný stín na plicích – kazuistika
- Masturbační chování žen v ČR − dotazníková studie
- Těžké menstruační krvácení může značit poruchu krevní srážlivosti. Jaký management vyšetření a léčby je v takovém případě vhodný?
- Fixní kombinace paracetamol/kodein nabízí synergické analgetické účinky
Najčítanejšie v tomto čísle
- Correction: Low dose naltrexone: Effects on medication in rheumatoid and seropositive arthritis. A nationwide register-based controlled quasi-experimental before-after study
- Combining CDK4/6 inhibitors ribociclib and palbociclib with cytotoxic agents does not enhance cytotoxicity
- Experimentally validated simulation of coronary stents considering different dogboning ratios and asymmetric stent positioning
- Risk factors associated with IgA vasculitis with nephritis (Henoch–Schönlein purpura nephritis) progressing to unfavorable outcomes: A meta-analysis