EUSKOR: End-to-end coreference resolution system for Basque

English version

Autoři: Ander Soraluze ^aff001; Olatz Arregi ^aff002; Xabier Arregi ^aff001; Arantza Díaz de Ilarraza ^aff001
Působiště autorů: Computer Languages and Systems Department, University of the Basque Country, Donostia-San Sebastian, Spain ^aff001; Computer Architecture and Technology Department, University of the Basque Country, Donostia-San Sebastian, Spain ^aff002
Vyšlo v časopise: PLoS ONE 14(9)
Kategorie: Research Article
prolekare.web.journal.doi_sk: https://doi.org/10.1371/journal.pone.0221801

Souhrn

This paper describes the process of adapting the Stanford Coreference resolution module to the Basque language, taking into account the characteristics of the language. The module has been integrated in a linguistic analysis pipeline obtaining an end-to-end coreference resolution system for the Basque language. The adaptation process explained can benefit and facilitate other languages with similar characteristics in the implementation of their coreference resolution systems. During the experimentation phase, we have demonstrated that language-specific features have a noteworthy effect on coreference resolution, obtaining a gain in CoNLL score of 7.07 with respect to the baseline system. We have also analysed the effect that preprocessing has in coreference resolution, comparing the results obtained with automatic mentions versus gold mentions. When gold mentions are provided, the results increase 11.5 points in CoNLL score in comparison with results obtained when automatic mentions are used. The contribution of each sieve is analysed concluding that morphology is essential for agglutinative languages to obtain good performance in coreference resolution. Finally, an error analysis of the coreference resolution system is presented which have revealed our system’s weak points and help to determine the improvements of the system. As a result of the error analysis, we have enriched the Basque coreference resolution adding new two sieves, obtaining an improvement of 0.24 points in CoNLL F₁ when automatic mentions are used and of 0.39 points when the gold mentions are provided.

Klíčová slova:

Biology and life sciences – Engineering and technology – Neuroscience – Cognitive science – Cognitive psychology – Psychology – Social sciences – Sociology – Communications – People and places – Computer and information sciences – Geographical locations – Europe – Mass media – Encyclopedias – Language – Linguistics – Linguistic morphology – Grammar – Syntax – Semantics – Online encyclopedias – Software engineering – Preprocessing

Zdroje

1. Pradhan S, Ramshaw L, Marcus M, Palmer M, Weischedel R, Xue N. CoNLL-2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task. CONLL Shared Task’11. Portland, Oregon; 2011. p. 1–27.

2. Lee H, Chang A, Peirsman Y, Chambers N, Surdeanu M, Jurafsky D. Deterministic Coreference Resolution Based on Entity-centric, Precision-ranked Rules. Computational Linguistics. 2013;39(4):885–916. doi: 10.1162/COLI_a_00152

3. MUC-6. Coreference Task Definition (v2.3, 8 Sep 95). In: Proceedings of the Sixth Message Understanding Conference (MUC-6). Columbia, Maryland, USA; 1995. p. 335–344.

4. Pradhan S, Moschitti A, Xue N, Uryupina O, Zhang Y. CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes. In: Joint Conference on EMNLP and CoNLL—Shared Task. Jeju Island, Korea: Association for Computational Linguistics; 2012. p. 1–40.

5. MUC-7. Coreference Task Definition (v3.0, 13 Jul 97). In: Proceedings of the 7th Message Understanding Conference (MUC-7). Fairfax, Virginia, USA; 1998.

6. Aone C, Bennett SW. Evaluating Automated and Manual Acquisition of Anaphora Resolution Strategies. In: Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics. ACL’95. Cambridge, Massachusetts: Association for Computational Linguistics; 1995. p. 122–129.

7. Ferrández A, Palomar M, Moreno L. An Empirical Approach to Spanish Anaphora Resolution. Machine Translation. 1999;14(3/4):191–216. doi: 10.1023/A:1011189309486

8. Popescu-Belis A, Robba I. Cooperation between Pronoun and Reference Resolution for Unrestricted Texts. In: Proceedings of the ACL’97/EACL’97 workshop on Operational factors in practical, robust anaphora resolution; 1997. p. 88–94.

9. Abraços J, Lopes JG. Extending DRT with a Focusing Mechanism for Pronominal Anaphora and Ellipsis Resolution. In: Proceedings of the 15th Conference on Computational Linguistics. COLING’94. Stroudsburg, PA, USA: Association for Computational Linguistics; 1994. p. 1128–1132.

10. Azzam S, Humphreys K, Gaizauskas R. Coreference Resolution in a Multilingual Information Extraction System. In: The First International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference. Granada, Spain; 1998.

11. Mitkov R. Multilingual Anaphora Resolution. Machine Translation. 1999;14(3):281–299. doi: 10.1023/A:1011184828072

12. Doddington G, Mitchell A, Przybocki M, Ramshaw L, Strassel S, Weischedel R. The Automatic Content Extraction (ACE) Program–Tasks, Data, and Evaluation. In: Proceedings of Language Resources and Evaluation Conference. (LREC 2004). Lisbon, Portugal: European Language Resources Association (ELRA); 2004. p. 837–840.

13. Recasens M, Màrquez L, Sapena E, Martí MA, Taulé M, Hoste V, et al. SemEval-2010 task 1: Coreference Resolution in Multiple Languages. In: Proceedings of the 5th International Workshop on Semantic Evaluation. (SemEval 2010). Uppsala, Sweden: Association for Computational Linguistics; 2010. p. 1–8.

14. Pradhan S, Hovy E, Marcus M, Palmer M, Ramshaw L, Weischedel R. OntoNotes: A Unified Relational Semantic Representation. In: Proceedings of the International Conference on Semantic Computing. (ICSC’07). Washington, DC, USA: IEEE Computer Society; 2007. p. 517–526.

15. Chen C, Ng V. Combining the Best of Two Worlds: A Hybrid Approach to Multilingual Coreference Resolution. In: Joint Conference on EMNLP and CoNLL—Shared Task. CoNLL’12. Jeju Island, Korea: Association for Computational Linguistics; 2012. p. 56–63.

16. Fernandes E, dos Santos C, Milidiú R. Latent Structure Perceptron with Feature Induction for Unrestricted Coreference Resolution. In: Joint Conference on EMNLP and CoNLL—Shared Task. CoNLL’12. Jeju Island, Korea: Association for Computational Linguistics; 2012. p. 41–48.

17. Shou H, Zhao H. System paper for CoNLL-2012 shared task: Hybrid Rule-based Algorithm for Coreference Resolution. In: Joint Conference on EMNLP and CoNLL—Shared Task. CoNLL’12. Jeju Island, Korea: Association for Computational Linguistics; 2012. p. 118–121.

18. Xiong H, Liu Q. ICT: System Description for CoNLL-2012. In: Joint Conference on EMNLP and CoNLL—Shared Task. CoNLL’12. Jeju Island, Korea: Association for Computational Linguistics; 2012. p. 71–75.

19. Yuan B, Chen Q, Xiang Y, Wang X, Ge L, Liu Z, et al. A Mixed Deterministic Model for Coreference Resolution. In: Joint Conference on EMNLP and CoNLL—Shared Task. CoNLL’12. Jeju Island, Korea: Association for Computational Linguistics; 2012. p. 76–82.

20. Zhang X, Wu C, Zhao H. Chinese Coreference Resolution via Ordered Filtering. In: Joint Conference on EMNLP and CoNLL—Shared Task. CoNLL’12. Jeju Island, Korea: Association for Computational Linguistics; 2012. p. 95–99.

21. Durrett G, Klein D. Easy Victories and Uphill Battles in Coreference Resolution. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA: Association for Computational Linguistics; 2013. p. 1971–1982.

22. Fernandes ER, dos Santos CN, Milidiú RL. Latent Trees for Coreference Resolution. Computational Linguistics. 2014;40(4):801–835. doi: 10.1162/COLI_a_00200

23. Wiseman S, Rush AM, Shieber S, Weston J. Learning Anaphoricity and Antecedent Ranking Features for Coreference Resolution. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Beijing, China: Association for Computational Linguistics; 2015. p. 1416–1426.

24. Wiseman S, Rush AM, Shieber SM. Learning Global Features for Coreference Resolution. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, California: Association for Computational Linguistics; 2016. p. 994–1004.

25. Lee H, Surdeanu M, Jurafsky D. A scaffolding approach to coreference resolution integrating statistical and rule-based models. Natural Language Engineering. 2017; p. 1–30.

26. Ogrodniczuk M, Ng V, editors. Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016). San Diego, California, USA: Association for Computational Linguistics; 2016.

27. Ogrodniczuk M, Ng V, editors. Proceedings of the 2nd Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2017). Valencia, Spain: Association for Computational Linguistics; 2017.

28. Versley Y, Ponzetto SP, Poesio M, Eidelman V, Jern A, Smith J, et al. BART: A Modular Toolkit for Coreference Resolution. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Demo Session. HLT-Demonstrations’08. Columbus, Ohio: Association for Computational Linguistics; 2008. p. 9–12.

29. Poesio M, Uryupina O, Versley Y. Creating a Coreference Resolution System for Italian. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). Valletta, Malta: European Language Resources Association (ELRA); 2010. p. 713–716.

30. Broscheit S, Ponzetto SP, Versley Y, Poesio M. Extending BART to Provide a Coreference Resolution System for German. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010. Valletta, Malta; 2010. p. 164–167.

31. Kopeć M, Ogrodniczuk M. Creating a Coreference Resolution System for Polish. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). Istanbul, Turkey: European Language Resources Association (ELRA); 2012. p. 192–195.

32. Uryupina O, Moschitti A, Poesio M. BART Goes Multilingual: The UniTN/Essex Submission to the CoNLL-2012 Shared Task. In: Joint Conference on EMNLP and CoNLL—Shared Task. CoNLL’12. Jeju Island, Korea: Association for Computational Linguistics; 2012. p. 122–128.

33. Sikdar UK, Ekbal A, Saha S. A generalized framework for anaphora resolution in Indian languages. Knowledge-Based Systems. 2016;109:147–159. doi: 10.1016/j.knosys.2016.06.033

34. Laka I. A Brief Grammar of Euskara, the Basque Language; 1996. http://www.ehu.es/grammar.

35. Aduriz I, Aranzabe MJ, Arriola JM, Atutxa M, Díaz de Ilarraza A, Ezeiza N, et al. Methodology and Steps towards the Construction of EPEC, a Corpus of Written Basque Tagged at Morphological and Syntactic Levels for the Automatic Processing. In: Language and Computers. vol. 56. Amsterdam, Netherlands: Rodopi; 2006. p. 1–15.

36. Soraluze A, Arregi O, Arregi X, Díaz De Ilarraza A. Improving mention detection for Basque based on a deep error analysis. Natural Language Engineering. 2016;23(3):351–384. doi: 10.1017/S1351324916000206

37. Ceberio K, Aduriz I, Díaz de Ilarraza A, Garcia-Azkoaga I. Coreferential Relations in Basque: The Annotation Process. Journal of Psycholinguistic Research. 2018;47(2):325–342. doi: 10.1007/s10936-018-9559-6 29399705

38. Alegria I, Artola X, Sarasola K, Urkia M. Automatic Morphological Analysis of Basque. Literary & Linguistic Computing. 1996;11(4):193–203. doi: 10.1093/llc/11.4.193

39. Alegria I, Aranzabe MJ, Ezeiza N, Ezeiza A, Urizar R. Using Finite State Technology in Natural Language Processing of Basque. In: Watson BW, Wood D, editors. Implementation and Application of Automata. vol. 2494 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg; 2002. p. 1–12.

40. Alegria I, Ansa O, Artola X, Ezeiza N, Gojenola K, Urizar R. Representation and Treatment of Multiword Expressions in Basque. In: Proceedings of the Workshop on Multiword Expressions: Integrating Processing. MWE’04. Barcelona, Spain: Association for Computational Linguistics; 2004. p. 48–55.

41. Alegria I, Ezeiza N, Fernandez I, Urizar R. Named Entity Recognition and Classification for texts in Basque. In: II Jornadas de Tratamiento y Recuperación de Información. (JOTRI 2003). Madrid, Spain; 2003. p. 198–203.

42. Soraluze A, Alegria I, Ansa O, Arregi O, Arregi X. Recognition and Classification of Numerical Entities in Basque. In: Recent Advances in Natural Language Processing (RANLP). Hissar,Bulgaria; 2011. p. 764–769.

43. Bengoetxea K, Gojenola K. Application of Different Techniques to Dependency Parsing of Basque. In: Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages. SPMRL’10. Los Angeles, California: Association for Computational Linguistics; 2010. p. 31–39.

44. Díaz de Ilarraza A, Fernández-Terrones E, Aldezabal I, Aranzabe MJ. From Dependencies to Constituents in the Reference Corpus for the Processing of Basque (EPEC). Procesamiento del Lenguaje Natural. 2008;41.

45. Hulden M. Foma: A Finite-state Compiler and Library. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session. EACL’09. Athens, Greece: Association for Computational Linguistics; 2009. p. 29–32.

46. Hobbs J. Resolving Pronoun References. Lingua. 1978;44:311–338. doi: 10.1016/0024-3841(78)90006-2

47. Soraluze A, Arregi O, Arregi X, Díaz de Ilarraza A. Coreference Resolution for Morphologically Rich Languages. Adaptation of the Stanford System to Basque. Procesamiento del Lenguaje Natural. 2015;55:23–30.

48. Gonzalez-Dios I, Aranzabe MJ, Díaz de Ilarraza A, Soraluze A. Detecting Apposition for Text Simplification in Basque. In: Proceedings of the 14th International Conference on Computational Linguistics and Intelligent Text Processing—Volume 2. CICLing’13. Berlin, Heidelberg: Springer-Verlag; 2013. p. 513–524.

49. Vilain M, Burger J, Aberdeen J, Connolly D, Hirschman L. A Model-theoretic Coreference Scoring Scheme. In: Proceedings of the 6th Conference on Message Understanding. MUC6’95. Columbia, Maryland: Association for Computational Linguistics; 1995. p. 45–52.

50. Bagga A, Baldwin B. Algorithms for Scoring Coreference Chains. In: The First International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference. Granada, Spain; 1998. p. 563–566.

51. Luo X. On Coreference Resolution Performance Metrics. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. HLT’05. Vancouver, British Columbia, Canada: Association for Computational Linguistics; 2005. p. 25–32.

52. Recasens M, Hovy E. BLANC: Implementing the Rand index for coreference evaluation. Natural Language Engineering. 2011;17(4):485–510. doi: 10.1017/S135132491000029X

53. Pradhan S, Luo X, Recasens M, Hovy E, Ng V, Strube M. Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics; 2014. p. 30–35.

54. Soraluze A, Arregi O, Arregi X, Díaz de Ilarraza A. Enriching Basque Coreference Resolution System using Semantic Knowledge sources. In: Proceedings of the 2nd Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2017). Association for Computational Linguistics; 2017. p. 8–16.

55. Barrena A, Soroa A, Agirre E. Combining Mention Context and Hyperlinks from Wikipedia for Named Entity Disambiguation. In: Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics. Denver, Colorado: Association for Computational Linguistics; 2015. p. 101–105.

56. Pociello E, Agirre E, Aldezabal I. Methodology and construction of the Basque WordNet. Language Resources and Evaluation. 2011;45(2):121–142. doi: 10.1007/s10579-010-9131-y

57. Versley Y, Poesio M, Ponzetto S. Using Lexical and Encyclopedic Knowledge. In: Poesio M, Stuckardt R, Versley Y, editors. Anaphora Resolution: Algorithms, Resources, and Applications. Berlin, Heidelberg: Springer Berlin Heidelberg; 2016. p. 393–429.

58. Soraluze A, Arregi O, Arregi X, Díaz de Ilarraza A, Kabadjov M, Poesio M. Coreference Resolution for the Basque Language with BART. In: Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016). San Diego, California: Association for Computational Linguistics; 2016. p. 67–73.

59. Soon WM, Ng HT, Lim DCY. A Machine Learning Approach to Coreference Resolution of Noun Phrases. Computational Linguistics. 2001;27(4):521–544. doi: 10.1162/089120101753342653

60. Urbizu G, Soraluze A, Arregi O. Deep Cross-Lingual Coreference Resolution for Less-Resourced Languages: The Case of Basque. In: Proceedings of the Second Workshop on Computational Models of Reference, Anaphora and Coreference. Minneapolis, USA: Association for Computational Linguistics; 2019. p. 35–41.

EUSKOR: End-to-end coreference resolution system for Basque

Souhrn

Klíčová slova:

Zdroje

PLOS One

Aktuální možnosti diagnostiky a léčby litiáz