RUS  ENG
Full version
JOURNALS // Proceedings of the Institute for System Programming of the RAS // Archive

Proceedings of ISP RAS, 2023 Volume 35, Issue 5, Pages 215–228 (Mi tisp824)

Here we go again: modern GEC models need help with spelling

V. M. Starchenko, A. M. Starchenko

National Research University Higher School of Economics

Abstract: The study focuses on how modern GEC systems handle character-level errors. We discuss the ways these errors effect the performance of models and test how models of different architectures handle them. We conclude that specialized GEC systems do struggle against correcting non-existent words, and that a simple spellchecker considerably improve overall performance of a model. To evaluate it, we assess the models over several datasets. In addition to CoNLL-2014 validation dataset, we contribute a synthetic dataset with higher density of character-level errors and conclude that, provided that models generally show very high scores, validation datasets with higher density of tricky errors are a useful tool to compare models. Lastly, we notice cases of incorrect treatment of non-existent words on experts' annotation and contribute a cleared version of this dataset. In contrast to specialized GEC systems, LLaMA model used for GEC task handles character-level errors well. We suggest that this better performance is explained by the fact that Alpaca is not extensively trained on annotated texts with errors, but gets as input grammatically and orthographically correct texts.

Keywords: GEC, validation, spellcheck, preprocessing, generated datasets

Language: English

DOI: 10.15514/ISPRAS-2023-35(5)-14



© Steklov Math. Inst. of RAS, 2024