Résultats | Bibliographie COLaF

Zampieri, M., North, K., Jauhiainen, T., Felice, M., Kumari, N., Nair, N., & Bangera, Y. M. (2024). Language Variety Identification with True Labels. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 10100–10109). ELRA and ICCL. https://aclanthology.org/2024.lrec-main.882

Language identification is an important first step in many NLP applications. Most publicly available language identification datasets, however, are compiled under the assumption that the gold label of each instance is determined by where texts are retrieved from. Research has shown that this is a problematic assumption, particularly in the case of very similar languages (e.g., Croatian and Serbian) and national language varieties (e.g., Brazilian and European Portuguese), where texts may contain no distinctive marker of the particular language or variety. To overcome this important limitation, this paper presents DSL True Labels (DSL-TL), the first human-annotated multilingual dataset for language variety identification. DSL-TL contains a total of 12,900 instances in Portuguese, split between European Portuguese and Brazilian Portuguese; Spanish, split between Argentine Spanish and Castilian Spanish; and English, split between American English and British English. We trained multiple models to discriminate between these language varieties, and we present the results in detail. The data and models presented in this paper provide a reliable benchmark toward the development of robust and fairer language variety identification systems. We make DSL-TL freely available to the research community.

Consulter le document

Kevers, L. (2021). L’identification de langue, un outil au service du corse et de l’évaluation des ressources linguistiques [Language identification, a tool for Corsican and for the evaluation of linguistic resources]. Traitement Automatique des Langues, 62(3), 13–37. https://aclanthology.org/2021.tal-3.2

Consulter sur aclanthology.org

Caswell, I., Breiner, T., van Esch, D., & Bapna, A. (2020). Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus. In D. Scott, N. Bel, & C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics (pp. 6588–6608). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.579

Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2%. These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.

Consulter sur aclanthology.org

Lui, M., & Baldwin, T. (2011). Cross-domain Feature Selection for Language Identification. In H. Wang & D. Yarowsky (Eds.), Proceedings of 5th International Joint Conference on Natural Language Processing (pp. 553–561). Asian Federation of Natural Language Processing. https://aclanthology.org/I11-1062

Consulter sur aclanthology.org

Bernier-Colborne, G., Leger, S., & Goutte, C. (n.d.). Transfer Learning Improves French Cross-Domain Dialect Identification: NRC @ VarDial 2022.

We describe the systems developed by the National Research Council Canada for the French Cross-Domain Dialect Identification shared task at the 2022 VarDial evaluation campaign. We evaluated two different approaches to this task: SVM and probabilistic classifiers exploiting n-grams as features, and trained from scratch on the data provided; and a pre-trained French language model, CamemBERT, that we fine-tuned on the dialect identification task. The latter method turned out to improve the macro-F1 score on the test set from 0.344 to 0.430 (25% increase), which indicates that transfer learning can be helpful for dialect identification.

Consulter le document

Votre recherche

Résultats 5 ressources

Explorer

Langue

Tâche