Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Type de ressource
Conference Paper
Auteurs/contributeurs
Title
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
Abstract
Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2%. These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.
Date
2020-12
Proceedings Title
Proceedings of the 28th International Conference on Computational Linguistics
Place
Barcelona, Spain (Online)
Publisher
International Committee on Computational Linguistics
Pages
6588–6608
Référence
Caswell, I., Breiner, T., van Esch, D., & Bapna, A. (2020). Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus. In D. Scott, N. Bel, & C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics (pp. 6588–6608). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.579