Why Cohen’s Kappa should be avoided as performance measure in classification
Autoři:
Rosario Delgado aff001; Xavier-Andoni Tibau aff002
Působiště autorů:
Department of Mathematics, Universitat Autònoma de Barcelona, Campus de la UAB, Cerdanyola del Vallès, Spain
aff001; Advanced Stochastic Modelling research group, Universitat Autònoma de Barcelona, Campus de la UAB, Cerdanyola del Vallès, Spain
aff002
Vyšlo v časopise:
PLoS ONE 14(9)
Kategorie:
Research Article
prolekare.web.journal.doi_sk:
https://doi.org/10.1371/journal.pone.0222916
Souhrn
We show that Cohen’s Kappa and Matthews Correlation Coefficient (MCC), both extended and contrasted measures of performance in multi-class classification, are correlated in most situations, albeit can differ in others. Indeed, although in the symmetric case both match, we consider different unbalanced situations in which Kappa exhibits an undesired behaviour, i.e. a worse classifier gets higher Kappa score, differing qualitatively from that of MCC. The debate about the incoherence in the behaviour of Kappa revolves around the convenience, or not, of using a relative metric, which makes the interpretation of its values difficult. We extend these concerns by showing that its pitfalls can go even further. Through experimentation, we present a novel approach to this topic. We carry on a comprehensive study that identifies an scenario in which the contradictory behaviour among MCC and Kappa emerges. Specifically, we find out that when there is a decrease to zero of the entropy of the elements out of the diagonal of the confusion matrix associated to a classifier, the discrepancy between Kappa and MCC rise, pointing to an anomalous performance of the former. We believe that this finding disables Kappa to be used in general as a performance measure to compare classifiers.
Klíčová slova:
Psychology – Medicine and health sciences – Probability distribution – Machine learning – Entropy – Statistical distributions – Protein structure prediction – Covariance
Zdroje
1. Ferri C., Hernández-Orallo J., Modroiu R.: An experimental comparison of performance measures for classification. Pattern Recognition Letters 30(1), 27–38 (2009) doi: 10.1016/j.patrec.2008.08.010
2. Jurman G., Riccadonna S., Furlanello C.: A comparison of mcc and cen error measures in multi-class prediction. PloS one 7(8), e41882 (2012) doi: 10.1371/journal.pone.0041882
3. Sokolova M., Lapalme G.: A systematic analysis of performance measures for classification tasks. Information Processing & Management 45(4), 427–437 (2009) doi: 10.1016/j.ipm.2009.03.002
4. Matthews B.W.: Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405(2), 442–451 (1975) doi: 10.1016/0005-2795(75)90109-9
5. Gorodkin J.: Comparing two k-category assignments by a k-category correlation coefficient. Computational biology and chemistry 28(5-6), 367–374 (2004) doi: 10.1016/j.compbiolchem.2004.09.006 15556477
6. Stokić D., Hanel R., Thurner S.: A fast and efficient gene-network reconstruction method from multiple over-expression experiments. BMC bioinformatics 10(1), 253 (2009) doi: 10.1186/1471-2105-10-253 19686586
7. Supper, J., Spieth, C., Zell, A.: Reconstructing linear gene regulatory networks. In: European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, pp. 270–279. Springer (2007)
8. Blair E., Stanley F.: Interobserver agreement in the classification of cerebral palsy. Developmental Medicine & Child Neurology 27(5), 615–622 (1985) doi: 10.1111/j.1469-8749.1985.tb14133.x
9. Cameron M.L., Briggs K.K., Steadman J.R.: Reproducibility and reliability of the outerbridge classification for grading chondral lesions of the knee arthroscopically. The American journal of sports medicine 31(1), 83–86 (2003) doi: 10.1177/03635465030310012601 12531763
10. Monserud R.A., Leemans R.: Comparing global vegetation maps with the Kappa statistic. Ecological modelling 62(4), 275–293 (1992) doi: 10.1016/0304-3800(92)90003-W
11. Allouche O., Tsoar A., & Kadmon R.: Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS). Journal of applied ecology 43(6), 1223–1232 (2006) doi: 10.1111/j.1365-2664.2006.01214.x
12. Tian Y., Zhang H., Pang Y., Lin J.: Classification for single-trial N170 during responding to facial picture with emotion. Front. Comput. Neurosci. 12:68. doi: 10.3389/fncom.2018.00068 30271337
13. Donker D., Hasman A., Van Geijn H.: Interpretation of low Kappa values. International journal of bio-medical computing 33(1), 55–64 (1993) 8349359
14. Forbes A.D.: Classification-algorithm evaluation: Five performance measures based onconfusion matrices. Journal of Clinical Monitoring 11(3), 189–206 (1995) doi: 10.1007/BF01617722 7623060
15. Brennan R.L., Prediger D.J.: Coefficient Kappa: Some uses, misuses, and alternatives. Educational and psychological measurement 41(3), 687–699 (1981) doi: 10.1177/001316448104100307
16. Maclure M., Willett W.C.: Misinterpretation and misuse of the Kappa statistic. American journal of epidemiology 126(2), 161–169 (1987) doi: 10.1093/aje/126.2.161 3300279
17. Uebersax J.S.: Diversity of decision-making models and the measurement of interrater agreement. Psychological bulletin 101(1), 140–146 (1987) doi: 10.1037/0033-2909.101.1.140
18. Feinstein A.R., Cicchetti D.V.: High agreement but low Kappa: I. the problems of two paradoxes. Journal of clinical epidemiology 43(6), 543–549 (1990) doi: 10.1016/0895-4356(90)90158-l 2348207
19. Cicchetti D.V., Feinstein A.R.: High agreement but low Kappa: Ii. resolving the paradoxes. Journal of clinical epidemiology 43(6), 551–558 (1990) doi: 10.1016/0895-4356(90)90159-m 2189948
20. Krippendorff K.: Reliability in content analysis: Some common misconceptions and recommendations. Human communication research 30(3), 411–433 (2004) doi: 10.1111/j.1468-2958.2004.tb00738.x
21. Warrens M.J.: A formal proof of a paradox associated with Cohen’s Kappa. Journal of Classification 27(3), 322–332 (2010) doi: 10.1007/s00357-010-9060-x
22. Byrt T., Bishop J., & Carlin J. B.: Bias, prevalence and kappa. Journal of clinical epidemiology 46(5), 423–429 (1993) doi: 10.1016/0895-4356(93)90018-v 8501467
23. de Vet H.C., Mokkink L.B., Terwee C.B., Hoekstra O.S., Knol D.L.: Clinicians are right not to like Cohen’s Kappa. BMJ 346, f2125 (2013) doi: 10.1136/bmj.f2125 23585065
24. Dice L. R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945) doi: 10.2307/1932409
25. Albatineh A. N., Niewiadomska-Bugaj M., & Mihalko D.: On similarity indices and correction for chance agreement. Journal of Classification 23(2), 301–313 (2006) doi: 10.1007/s00357-006-0017-z
26. Warrens M. J.: On similarity coefficients for 2 × 2 tables and correction for chance. Psychometrika 73(3), 487 (2008) doi: 10.1007/s11336-008-9059-y 20037641
27. Cohen J.: A coefficient of agreement for nominal scales. Educational and psychological measurement 20(1), 37–46 (1960) doi: 10.1177/001316446002000104
28. Scott W.A.: Reliability of content analysis: The case of nominal scale coding. Public opinion quarterly pp. 321–325 (1955) doi: 10.1086/266577
29. Mak T. K.: Analysing intraclass correlation for dichotomous variables. Journal of the Royal Statistical Society: Series C (Applied Statistics) 37(3), 344–352 (1988)
30. Goodman L. A., & Kruskal W. H.: Measures of association for cross classifications III: Approximate sampling theory. Journal of the American Statistical Association, 58(302), 310–364 (1963) doi: 10.1080/01621459.1963.10500850
31. Brennan R. L., & Light R. J.: Measuring agreement when two observers classify people into categories not defined in advance. British Journal of Mathematical and Statistical Psychology 27(2), 154–163 (1974) doi: 10.1111/j.2044-8317.1974.tb00535.x
32. Bexkens R., Claessen F. M., Kodde I. F., Oh L. S., Eygendaal D., & van den Bekerom M. P.: The kappa paradox. Shoulder & Elbow, 10(4), 308–308 (2018) doi: 10.1177/1758573218791813
33. Viera A. J., & Garrett J. M.: Understanding interobserver agreement: the kappa statistic. Fam med 37(5), 360–363 (2005) 15883903
34. Sim J., & Wright C. C.: The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Physical therapy 85(3), 257–268 (2005) 15733050
35. Warrens M.J.: On association coefficients, correction for chance, and correction for maximum value. Journal of Modern Mathematics Frontier 2(4), 111–119 (2013)
36. Andrés A.M., Marzo P.F.: Delta: A new measure of agreement between two raters. British journal of mathematical and statistical psychology 57(1), 1–19 (2004) doi: 10.1348/000711004849268 15171798
37. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., et al.: Scikit-learn: Machine learning in python. Journal of machine learning research 12(Oct), 2825–2830 (2011)
38. Kuhn M., et al.: Caret package. Journal of statistical software 28(5), 1–26 (2008)
39. Huang C., Davis L., Townshend J.: An assessment of support vector machines for land cover classification. International Journal of remote sensing 23(4), 725–749 (2002) doi: 10.1080/01431160110040323
40. Duro D.C., Franklin S.E., Dubé M.G.: A comparison of pixel-based and object-based image analysis with selected machine learning algorithms for the classification of agricultural landscapes using spot-5 HRG imagery. Remote Sensing of Environment 118, 259–272 (2012) doi: 10.1016/j.rse.2011.11.020
41. Passos A.N., Kohara V.S., Freitas R.S.d., Vicentini A.P.: Immunological assays employed for the elucidation of an histoplasmosis outbreak in São Paulo, SP. Brazilian Journal of Microbiology 45(4), 1357–1361 (2014) doi: 10.1590/s1517-83822014000400028 25763041
42. Claessen F. M., van den Ende K. I., Doornberg J. N., Guitton T. G., Eygendaal D., van den Bekerom M. P., … & Wagener M.: Osteochondritis dissecans of the humeral capitellum: reliability of four classification systems using radiographs and computed tomography. Journal of shoulder and elbow surgery 24(10), 1613–1618 (2015) doi: 10.1016/j.jse.2015.03.029 25953486
43. Powers, D.M.W.: The problem with Kappa. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 345–355. Association for Computational Linguistics (2012)
44. Jeni, L.A., Cohn, J.F., De La Torre, F.: Facing imbalanced data–recommendations for the use of performance metrics. In: Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, pp. 245–251. IEEE (2013)
45. Zhao X., Liu J.S., Deng K.: Assumptions behind intercoder reliability indices. In Salmon Charles T. (ed.) Communication Yearbook 36, 419–480. New York: Routledge (2013)
46. Witten I.H., Frank E., Hall M.A., Pal C.J.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann (2016)
47. Krippendorff K.: Association, agreement, and equity. Quality and Quantity 21(2), 109–123 (1987) doi: 10.1007/BF00167603
48. Krippendorff K.: Content analysis: An introduction to its methodology (1980)
Článok vyšiel v časopise
PLOS One
2019 Číslo 9
- Metamizol jako analgetikum první volby: kdy, pro koho, jak a proč?
- Nejasný stín na plicích – kazuistika
- Masturbační chování žen v ČR − dotazníková studie
- Úspěšná resuscitativní thorakotomie v přednemocniční neodkladné péči
- Fixní kombinace paracetamol/kodein nabízí synergické analgetické účinky
Najčítanejšie v tomto čísle
- Graviola (Annona muricata) attenuates behavioural alterations and testicular oxidative stress induced by streptozotocin in diabetic rats
- CH(II), a cerebroprotein hydrolysate, exhibits potential neuro-protective effect on Alzheimer’s disease
- Comparison between Aptima Assays (Hologic) and the Allplex STI Essential Assay (Seegene) for the diagnosis of Sexually transmitted infections
- Assessment of glucose-6-phosphate dehydrogenase activity using CareStart G6PD rapid diagnostic test and associated genetic variants in Plasmodium vivax malaria endemic setting in Mauritania