Votre recherche
Résultats 8 ressources
-
The Occitan language is a less resourced language and is classified as `in danger' by the UNESCO. Thereby, it is important to build resources and tools that can help to safeguard and develop the digitisation of the language. CorpusArièja is a collection of 72 texts (just over 41,000 tokens) in the Occitan language of the French department of Ariège. The majority of the texts needed to be digitised and pass within an Optical Character Recognition. This corpus contains dialectal and spelling variation, but is limited to prose, without diachronic variation or genre variation. It is an annotated corpus with two levels of lemmatisation, POS tags and verbal inflection. One of the main aims of the corpus is to enable the conception of tools that can automatically annotate all Occitan texts, regardless of the dialect or spelling used. The Ariège territory is interesting because it includes the two variations that we focus on, dialectal and spelling. It has plenty of authors that write in their native language, their variety of Occitan.
-
Parallel corpora are still scarce for most of the world's language pairs. The situation is by no means different for regional languages of France. In addition, adequate web interfaces facilitate and encourage the use of parallel corpora by target users, such as language learners and teachers, as well as linguists. In this paper, we describe ParCoLab, a parallel corpus and a web platform for querying the corpus. From its onset, ParCoLab has been geared towards lower-resource languages, with an initial corpus in Serbian, along with French and English (later Spanish). We focus here on the extension of ParCoLab with a parallel corpus for four regional languages of France: Alsatian, Corsican, Occitan and Poitevin-Saintongeais. In particular, we detail criteria for choosing texts and issues related to their collection. The new parallel corpus contains more than 20k tokens per regional language.
-
While existing neural network-based approaches have shown promising results in Handwritten Text Recognition (HTR) for high-resource languages and standardized/machine-written text, their application to low-resource languages often presents challenges, resulting in reduced effectiveness. In this paper, we propose an innovative HTR approach that leverages the Transformer architecture for recognizing handwritten Old Occitan language. Given the limited availability of data, which comprises only word pairs of graphical variants and lemmas, we develop and rely on elaborate data augmentation techniques for both text and image data. Our model combines a custom-trained Swin image encoder with a BERT text decoder, which we pre-train using a large-scale augmented synthetic data set and fine-tune on the small human-labeled data set. Experimental results reveal that our approach surpasses the performance of current state-of-the-art models for Old Occitan HTR, including open-source Transformer-based models such as a fine-tuned TrOCR and commercial applications like Google Cloud Vision. To nurture further research and development, we make our models, data sets, and code publicly available.
-
Ce travail présente des contributions récentes à l'effort de doter l'occitan de ressources et outils pour le TAL. Plusieurs ressources existantes ont été modifiées ou adaptées, notamment un tokéniseur à base de règles, un lexique morphosyntaxique et un corpus arboré. Ces ressources ont été utilisées pour entraîner et évaluer des modèles neuronaux pour la lemmatisation. Dans le cadre de ces expériences, un nouveau corpus plus large (2 millions de tokens) provenant du Wikipédia a été annoté en parties du discours, lemmatisé et diffusé.
-
This paper presents OcWikiDisc, a new freely available corpus in Occitan, as well as language identification experiments on Occitan done as part of the corpus building process. Occitan is a regional language spoken mainly in the south of France and in parts of Spain and Italy. It exhibits rich diatopic variation, it is not standardized, and it is still low-resourced, especially when it comes to large downloadable corpora. We introduce OcWikiDisc, a corpus extracted from the talk pages associated with the Occitan Wikipedia. The version of the corpus with the most restrictive language filtering contains 8K user messages for a total of 618K tokens. The language filtering is performed based on language identification experiments with five off-the-shelf tools, including the new fasttext's language identification model from Meta AI's No Language Left Behind initiative, released in July 2022.
-
This paper outlines the ongoing effort of creating the first treebank for Occitan, a low-ressourced regional language spoken mainly in the south of France. We briefly present the global context of the project and report on its current status. We adopt the Universal Dependencies framework for this project. Our methodology is based on two main principles. Firstly, in order to guarantee the annotation quality, we use the agile annotation approach. Secondly, we rely on pre-processing using existing tools (taggers and parsers) to facilitate the work of human annotators, mainly through a delexicalized cross-lingual parsing approach. We present the results available at this point (annotation guidelines and a sub-corpus annotated with PoS tags and lemmas) and give the timeline for the rest of the work.
Explorer
Corpus
-
Texte
-
Annotated
(4)
- Morphology (3)
- Parallel (1)
- Web (1)
-
Annotated
(4)
Langue
- Occitan
- Alsacien (2)
- Corse (1)
-
Multilingue
(1)
- Langues COLaF (1)
- Picard (1)
- Poitevin-Saintongeais (1)