Votre recherche
Résultats 3 ressources
-
This paper presents Loflòc (Lexic obèrt flechit Occitan – Open Inflected Lexicon of Occitan), a morphological lexicon for Occitan. Even though the lexicon no longer occupies the same place in the NLP pipeline since the advent of large language models, it remains a crucial resource for low-resourced languages. Occitan is a Romance language spoken in the south of France and in parts of Italy and Spain. It is not recognized as an official language in France and no standard variety is shared across the area. To the best of our knowledge, Loflòc is the first publicly available lexicon for Occitan. It contains 650 thousand entries for 57 thousand lemmas. Each entry is accompanied by the corresponding Universal Dependencies Part-of-Speech tag. We show that the lexicon has solid coverage on the existing freely available corpora of Occitan in four major dialects. Coverage gaps on multi-dialect corpora are overwhelmingly driven by dialectal variation, which affects both open and closed classes. Based on this analysis we propose directions for future improvements.
-
This paper describes different approaches for developing, for the first time, an automatic speech recognition system for two of the main dialects of Occitan, namely Gascon and Languedocian, and the results obtained in them. The difficulty of the task lies in the fact that Occitan is a less-resourced language. Although a great effort has been made to collect or create corpora of each variant (transcribed speech recordings for the acoustic models and two text corpora for the language models), the sizes of the corpora obtained are far from those of successful systems reported in the literature, and thus we have tested different techniques to compensate for the lack of resources. We have developed classical systems using Kaldi, creating an acoustic model for each variant and also creating language models from the collected corpora and from machine translated texts. We have also tried fine-tuning a Whisper model with our speech corpora. We report word error rates of 20.86 for Gascon and 13.52 for Languedocian with the Kaldi systems and 16.37 for Gascon and 11.74 for Languedocian with Whisper.
-
Parallel corpora are still scarce for most of the world's language pairs. The situation is by no means different for regional languages of France. In addition, adequate web interfaces facilitate and encourage the use of parallel corpora by target users, such as language learners and teachers, as well as linguists. In this paper, we describe ParCoLab, a parallel corpus and a web platform for querying the corpus. From its onset, ParCoLab has been geared towards lower-resource languages, with an initial corpus in Serbian, along with French and English (later Spanish). We focus here on the extension of ParCoLab with a parallel corpus for four regional languages of France: Alsatian, Corsican, Occitan and Poitevin-Saintongeais. In particular, we detail criteria for choosing texts and issues related to their collection. The new parallel corpus contains more than 20k tokens per regional language.
Explorer
Langue
- Occitan
- Alsacien (1)
- Corse (1)
-
Multilingue
(1)
- Langues COLaF (1)
- Poitevin-Saintongeais (1)