Votre recherche
Résultats 8 ressources
-
We present lemmatization experiments on the unstandardized low-resourced languages Low Saxon and Occitan using two machine-learningbased approaches represented by MaChAmp and Stanza. We show different ways to increase training data by leveraging historical corpora, small amounts of gold data and dictionary information, and discuss the usefulness of this additional data. In the results, we find some differences in the performance of the models depending on the language. This variation is likely to be partly due to differences in the corpora we used, such as the amount of internal variation. However, we also observe common tendencies, for instance that sequential models trained only on gold-annotated data often yield the best overall performance and generalize better to unknown tokens.
-
A python programme to tokenise texts in Occitan based on rules. To launch the programme, execute the following instruction: python3 tokenizer_occitan.py < input.txt > output.conllu The script takes as input a text file with a single sentence per line, starting by a sentence ID, followed by a tab character, followed by the sentence itself. The current version of the tool was developped during the projects DIVITAL (funded by the ANR) and CorCoDial (funded by the Academy of Finland).
-
Ce travail présente des contributions récentes à l'effort de doter l'occitan de ressources et outils pour le TAL. Plusieurs ressources existantes ont été modifiées ou adaptées, notamment un tokéniseur à base de règles, un lexique morphosyntaxique et un corpus arboré. Ces ressources ont été utilisées pour entraîner et évaluer des modèles neuronaux pour la lemmatisation. Dans le cadre de ces expériences, un nouveau corpus plus large (2 millions de tokens) provenant du Wikipédia a été annoté en parties du discours, lemmatisé et diffusé.
-
This paper presents OcWikiDisc, a new freely available corpus in Occitan, as well as language identification experiments on Occitan done as part of the corpus building process. Occitan is a regional language spoken mainly in the south of France and in parts of Spain and Italy. It exhibits rich diatopic variation, it is not standardized, and it is still low-resourced, especially when it comes to large downloadable corpora. We introduce OcWikiDisc, a corpus extracted from the talk pages associated with the Occitan Wikipedia. The version of the corpus with the most restrictive language filtering contains 8K user messages for a total of 618K tokens. The language filtering is performed based on language identification experiments with five off-the-shelf tools, including the new fasttext's language identification model from Meta AI's No Language Left Behind initiative, released in July 2022.
-
Cet article présente un retour d'expérience sur la transformation de corpus annotés pour l'alsacien et l'occitan vers le format CONLL-U défini dans le projet Universal Dependencies. Il met en particulier l'accent sur divers points de vigilance à prendre en compte, concernant la tokénisation et la définition des catégories pour l'annotation.
-
This paper presents Loflòc (Lexic obèrt flechit Occitan – Open Inflected Lexicon of Occitan), a morphological lexicon for Occitan. Even though the lexicon no longer occupies the same place in the NLP pipeline since the advent of large language models, it remains a crucial resource for low-resourced languages. Occitan is a Romance language spoken in the south of France and in parts of Italy and Spain. It is not recognized as an official language in France and no standard variety is shared across the area. To the best of our knowledge, Loflòc is the first publicly available lexicon for Occitan. It contains 650 thousand entries for 57 thousand lemmas. Each entry is accompanied by the corresponding Universal Dependencies Part-of-Speech tag. We show that the lexicon has solid coverage on the existing freely available corpora of Occitan in four major dialects. Coverage gaps on multi-dialect corpora are overwhelmingly driven by dialectal variation, which affects both open and closed classes. Based on this analysis we propose directions for future improvements.
-
This paper outlines the ongoing effort of creating the first treebank for Occitan, a low-ressourced regional language spoken mainly in the south of France. We briefly present the global context of the project and report on its current status. We adopt the Universal Dependencies framework for this project. Our methodology is based on two main principles. Firstly, in order to guarantee the annotation quality, we use the agile annotation approach. Secondly, we rely on pre-processing using existing tools (taggers and parsers) to facilitate the work of human annotators, mainly through a delexicalized cross-lingual parsing approach. We present the results available at this point (annotation guidelines and a sub-corpus annotated with PoS tags and lemmas) and give the timeline for the rest of the work.