OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan

Miletic, Aleksandra; Scherrer, Yves

OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan

Type de ressource

Conference Paper

Auteurs/contributeurs

Miletic, Aleksandra (Author)
Scherrer, Yves (Author)
Scherrer, Yves (Editor)
Jauhiainen, Tommi (Editor)
Ljubešić, Nikola (Editor)
Nakov, Preslav (Editor)
Tiedemann, Jörg (Editor)
Zampieri, Marcos (Editor)

Title

OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan

Abstract

This paper presents OcWikiDisc, a new freely available corpus in Occitan, as well as language identification experiments on Occitan done as part of the corpus building process. Occitan is a regional language spoken mainly in the south of France and in parts of Spain and Italy. It exhibits rich diatopic variation, it is not standardized, and it is still low-resourced, especially when it comes to large downloadable corpora. We introduce OcWikiDisc, a corpus extracted from the talk pages associated with the Occitan Wikipedia. The version of the corpus with the most restrictive language filtering contains 8K user messages for a total of 618K tokens. The language filtering is performed based on language identification experiments with five off-the-shelf tools, including the new fasttext's language identification model from Meta AI's No Language Left Behind initiative, released in July 2022.

Date

2022-10

Proceedings Title

Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects

Place

Gyeongju, Republic of Korea

Publisher

Association for Computational Linguistics

Pages

70–79

URL

https://aclanthology.org/2022.vardial-1.8

Référence

Miletic, A., & Scherrer, Y. (2022). OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan. In Y. Scherrer, T. Jauhiainen, N. Ljubešić, P. Nakov, J. Tiedemann, & M. Zampieri (Eds.), Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects (pp. 70–79). Association for Computational Linguistics. https://aclanthology.org/2022.vardial-1.8

Corpus

Texte
- Web

Langue

Occitan

Lien vers cette notice

https://colaf.huma-num.fr/bibliography/53Y4ZFT2