OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan
Type de ressource
Conference Paper
Auteurs/contributeurs
- Miletic, Aleksandra (Author)
- Scherrer, Yves (Author)
- Scherrer, Yves (Editor)
- Jauhiainen, Tommi (Editor)
- Ljubešić, Nikola (Editor)
- Nakov, Preslav (Editor)
- Tiedemann, Jörg (Editor)
- Zampieri, Marcos (Editor)
Title
OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan
Abstract
This paper presents OcWikiDisc, a new freely available corpus in Occitan, as well as language identification experiments on Occitan done as part of the corpus building process. Occitan is a regional language spoken mainly in the south of France and in parts of Spain and Italy. It exhibits rich diatopic variation, it is not standardized, and it is still low-resourced, especially when it comes to large downloadable corpora. We introduce OcWikiDisc, a corpus extracted from the talk pages associated with the Occitan Wikipedia. The version of the corpus with the most restrictive language filtering contains 8K user messages for a total of 618K tokens. The language filtering is performed based on language identification experiments with five off-the-shelf tools, including the new fasttext's language identification model from Meta AI's No Language Left Behind initiative, released in July 2022.
Date
2022-10
Proceedings Title
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects
Place
Gyeongju, Republic of Korea
Publisher
Association for Computational Linguistics
Pages
70–79
Référence
Miletic, A., & Scherrer, Y. (2022). OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan. In Y. Scherrer, T. Jauhiainen, N. Ljubešić, P. Nakov, J. Tiedemann, & M. Zampieri (Eds.), Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects (pp. 70–79). Association for Computational Linguistics. https://aclanthology.org/2022.vardial-1.8
Langue
Lien vers cette notice