Votre recherche
Résultats 33 ressources
-
Word classes are linguistic categories serving as basis in the description of the vocabulary and grammar of natural languages. While important publications are regularly devoted to their definition, identification, and classification, in the field of Romance linguistics we lack a comprehensive, state-of-the-art overview of the current research. This Manual offers an updated and detailed discussion of all relevant aspects related to word classes in the Romance languages. In the first part, word classes are discussed from both a theoretical and historical point of view. The second part of the volume takes as its point of departure single word classes, described transversally in all the main Romance languages, while the third observes the relevant word classes from the point of view of specific Romance(-based) varieties. The fourth part explores Romance word classes at the interface of grammar and other fields of research. The Manual is intended as a reference work for all scholars and students interested in the description of both the standard, major Romance languages and the smaller, lesser described Romance(-based) varieties.
-
The Manual offers the first comprehensive overview on Occitan and Occitan linguistics, including the latest research. With 26 contributions organized in seven parts, it covers diachronic and synchronic language description, diatopic variation and sociolinguistics, language planning and equipment, as well as teaching and language pedagogy.
-
A python programme to tokenise texts in Occitan based on rules. To launch the programme, execute the following instruction: python3 tokenizer_occitan.py < input.txt > output.conllu The script takes as input a text file with a single sentence per line, starting by a sentence ID, followed by a tab character, followed by the sentence itself. The current version of the tool was developped during the projects DIVITAL (funded by the ANR) and CorCoDial (funded by the Academy of Finland).
-
Effectively normalizing spellings in textual data poses a considerable challenge, especially for low-resource languages lacking standardized writing systems. In this study, we fine-tuned a multilingual model with data from several Occitan dialects and conducted a series of experiments to assess the model's representations of these dialects. For evaluation purposes, we compiled a parallel lexicon encompassing four Occitan dialects.Intrinsic evaluations of the model's embeddings revealed that surface similarity between the dialects strengthened representations. When the model was further fine-tuned for part-of-speech tagging, its performance was robust to dialectical variation, even when trained solely on part-of-speech data from a single dialect. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
-
This paper describes different approaches for developing, for the first time, an automatic speech recognition system for two of the main dialects of Occitan, namely Gascon and Languedocian, and the results obtained in them. The difficulty of the task lies in the fact that Occitan is a less-resourced language. Although a great effort has been made to collect or create corpora of each variant (transcribed speech recordings for the acoustic models and two text corpora for the language models), the sizes of the corpora obtained are far from those of successful systems reported in the literature, and thus we have tested different techniques to compensate for the lack of resources. We have developed classical systems using Kaldi, creating an acoustic model for each variant and also creating language models from the collected corpora and from machine translated texts. We have also tried fine-tuning a Whisper model with our speech corpora. We report word error rates of 20.86 for Gascon and 13.52 for Languedocian with the Kaldi systems and 16.37 for Gascon and 11.74 for Languedocian with Whisper.
-
The Occitan language is a less resourced language and is classified as `in danger' by the UNESCO. Thereby, it is important to build resources and tools that can help to safeguard and develop the digitisation of the language. CorpusArièja is a collection of 72 texts (just over 41,000 tokens) in the Occitan language of the French department of Ariège. The majority of the texts needed to be digitised and pass within an Optical Character Recognition. This corpus contains dialectal and spelling variation, but is limited to prose, without diachronic variation or genre variation. It is an annotated corpus with two levels of lemmatisation, POS tags and verbal inflection. One of the main aims of the corpus is to enable the conception of tools that can automatically annotate all Occitan texts, regardless of the dialect or spelling used. The Ariège territory is interesting because it includes the two variations that we focus on, dialectal and spelling. It has plenty of authors that write in their native language, their variety of Occitan.
-
This paper presents Loflòc (Lexic obèrt flechit Occitan – Open Inflected Lexicon of Occitan), a morphological lexicon for Occitan. Even though the lexicon no longer occupies the same place in the NLP pipeline since the advent of large language models, it remains a crucial resource for low-resourced languages. Occitan is a Romance language spoken in the south of France and in parts of Italy and Spain. It is not recognized as an official language in France and no standard variety is shared across the area. To the best of our knowledge, Loflòc is the first publicly available lexicon for Occitan. It contains 650 thousand entries for 57 thousand lemmas. Each entry is accompanied by the corresponding Universal Dependencies Part-of-Speech tag. We show that the lexicon has solid coverage on the existing freely available corpora of Occitan in four major dialects. Coverage gaps on multi-dialect corpora are overwhelmingly driven by dialectal variation, which affects both open and closed classes. Based on this analysis we propose directions for future improvements.
-
Cet ouvrage rassemble des travaux menés de 2019 à 2020 en lien avec un projet de recherche du Centre d'études franco-russes du CNRS sur les noms des variantes de langue minoritaire.
-
While existing neural network-based approaches have shown promising results in Handwritten Text Recognition (HTR) for high-resource languages and standardized/machine-written text, their application to low-resource languages often presents challenges, resulting in reduced effectiveness. In this paper, we propose an innovative HTR approach that leverages the Transformer architecture for recognizing handwritten Old Occitan language. Given the limited availability of data, which comprises only word pairs of graphical variants and lemmas, we develop and rely on elaborate data augmentation techniques for both text and image data. Our model combines a custom-trained Swin image encoder with a BERT text decoder, which we pre-train using a large-scale augmented synthetic data set and fine-tune on the small human-labeled data set. Experimental results reveal that our approach surpasses the performance of current state-of-the-art models for Old Occitan HTR, including open-source Transformer-based models such as a fine-tuned TrOCR and commercial applications like Google Cloud Vision. To nurture further research and development, we make our models, data sets, and code publicly available.
-
Occitan is a Romance language of France, a little part of Italy and Spain. It includes many written variations, dialectal and spelling variations. Being able to take variation into account is a major challenge to provide the language. Automatic processing of Occitan has been developing over the last ten years. Resources and tools have been developed and are beginning to take dialectal variation into account in these works. However, graphical variation is rarely taken into account. Our research focuses on the automatic annotation into lemmas, parts of speech and verbal inflection of a corpus of texts containing these two types of variation. From this corpus we train robust automatic annotation tools on global variation in Occitan.
-
Apertium translation pair for Occitan and French
-
Ce travail présente des contributions récentes à l'effort de doter l'occitan de ressources et outils pour le TAL. Plusieurs ressources existantes ont été modifiées ou adaptées, notamment un tokéniseur à base de règles, un lexique morphosyntaxique et un corpus arboré. Ces ressources ont été utilisées pour entraîner et évaluer des modèles neuronaux pour la lemmatisation. Dans le cadre de ces expériences, un nouveau corpus plus large (2 millions de tokens) provenant du Wikipédia a été annoté en parties du discours, lemmatisé et diffusé.
-
We present lemmatization experiments on the unstandardized low-resourced languages Low Saxon and Occitan using two machine-learningbased approaches represented by MaChAmp and Stanza. We show different ways to increase training data by leveraging historical corpora, small amounts of gold data and dictionary information, and discuss the usefulness of this additional data. In the results, we find some differences in the performance of the models depending on the language. This variation is likely to be partly due to differences in the corpora we used, such as the amount of internal variation. However, we also observe common tendencies, for instance that sequential models trained only on gold-annotated data often yield the best overall performance and generalize better to unknown tokens.
-
This paper presents OcWikiDisc, a new freely available corpus in Occitan, as well as language identification experiments on Occitan done as part of the corpus building process. Occitan is a regional language spoken mainly in the south of France and in parts of Spain and Italy. It exhibits rich diatopic variation, it is not standardized, and it is still low-resourced, especially when it comes to large downloadable corpora. We introduce OcWikiDisc, a corpus extracted from the talk pages associated with the Occitan Wikipedia. The version of the corpus with the most restrictive language filtering contains 8K user messages for a total of 618K tokens. The language filtering is performed based on language identification experiments with five off-the-shelf tools, including the new fasttext's language identification model from Meta AI's No Language Left Behind initiative, released in July 2022.
-
Apertium linguistic data for Occitan
-
Cross-lingual word embeddings (CLWEs) have proven indispensable for various natural language processing tasks, e.g., bilingual lexicon induction (BLI). However, the lack of data often impairs the quality of representations. Various approaches requiring only weak cross-lingual supervision were proposed, but current methods still fail to learn good CLWEs for languages with only a small monolingual corpus. We therefore claim that it is necessary to explore further datasets to improve CLWEs in low-resource setups. In this paper we propose to incorporate data of related high-resource languages. In contrast to previous approaches which leverage independently pre-trained embeddings of languages, we (i) train CLWEs for the low-resource and a related language jointly and (ii) map them to the target language to build the final multilingual space. In our experiments we focus on Occitan, a low-resource Romance language which is often neglected due to lack of resources. We leverage data from French, Spanish and Catalan for training and evaluate on the Occitan-English BLI task. By incorporating supporting languages our method outperforms previous approaches by a large margin. Furthermore, our analysis shows that the degree of relatedness between an incorporated language and the low-resource language is critically important.
-
Occitan is a minority language spoken in Southern France, some Alpine Valleys of Italy, and the Val d'Aran in Spain, which only very recently started developing language and speech technologies. This paper describes the first project for designing a Text-to-Speech synthesis system for one of its main regional varieties, namely Gascon. We used a state-of-the-art deep neural network approach, the Tacotron2-WaveGlow system. However, we faced two additional difficulties or challenges: on the one hand, we wanted to test if it was possible to obtain good quality results with fewer recording hours than is usually reported for such systems; on the other hand, we needed to achieve a standard, non-Occitan pronunciation of French proper names, therefore we needed to record French words and test phoneme-based approaches. The evaluation carried out over the various developed systems and approaches shows promising results with near production-ready quality. It has also allowed us to detect the phenomena for which some flaws or fall of quality occur, pointing at the direction of future work to improve the quality of the actual system and for new systems for other language varieties and voices.
-
This paper outlines the ongoing effort of creating the first treebank for Occitan, a low-ressourced regional language spoken mainly in the south of France. We briefly present the global context of the project and report on its current status. We adopt the Universal Dependencies framework for this project. Our methodology is based on two main principles. Firstly, in order to guarantee the annotation quality, we use the agile annotation approach. Secondly, we rely on pre-processing using existing tools (taggers and parsers) to facilitate the work of human annotators, mainly through a delexicalized cross-lingual parsing approach. We present the results available at this point (annotation guidelines and a sub-corpus annotated with PoS tags and lemmas) and give the timeline for the rest of the work.
-
Loflòc (Lexic obert flechit occitan - Lexique ouvert fléchi occitan) est un lexique informatisé de formes fléchies en occitan. Il a été réalisé dans le cadre du projet ANR RESTAURE (Bernard et Vergez-Couret, 2016) en collaboration avec Lo Congrès Permanent de la Lenga Occitana . La création d'un lexique informatisé pour l'occitan s'intègre dans un projet plus global de création de ressources linguistiques informatisées (pour une langue qui dispose de peu de ressources à l'heure actuelle). Ces ressources, qu’elles soient lexicales comme LoFlòc, ou textuelles comme BaTelÒc (Bras et Thomas 2011, Bras et Vergez Couret 2016), sont conçues en suivant un double objectif : d'une part la préservation et la diffusion du patrimoine linguistique et d'autre part la création de ressources pour le développement d'outils de traitement automatique des langues (par exemple des outils pour la recherche et l’extraction d'information, la traduction automatique...). La création de ces ressources se fait en harmonie avec la Feuille de route pour le développement du numérique occitan (Lo Congrès, 2014 ; Dazéas, 2015, Séguier et Mercadier, 2016). Les objectifs qui ont présidé à la création de Loflòc sont les suivants : •Doter l'occitan d'un lexique structuré de formes fléchies adapté aux besoins du TAL (Traitement Automatique des Langues) pour être intégré à des applications comme un lemmatiseur ou un analyseur morphosyntaxique (Vergez-Couret et Urieli, 2015) ; •Intégrer le lexique dans une interface de consultation ; •Utiliser un jeu d'étiquettes morphosyntaxiques (tagset) standard ; •Accueillir par étapes toute la variation (dialectale, intra-dialectale, graphique). Les variations, qu'elles soient dialectales, intradialectales ou graphiques, sont présentes dans les productions en occitan, anciennes et actuelles. Les outils automatiques, tout comme les locuteurs (néo-locuteurs, apprenants…), sont confrontés à toutes ces variations. Afin de bâtir des outils les plus robustes possibles, il faut savoir décrire et représenter cette variation dans les lexiques. En outre, dans les outils de consultation et d'interrogation du lexique, l'utilisateur pourra découvrir et mieux appréhender toute la variation possible. Pour constituer ce lexique, nous commençons par intégrer des ressources existantes au format numérique, à les enrichir avec des informations grammaticales lorsque ces dernières sont incomplètes ou inadaptées et à compléter les paradigmes flexionnels (genre et nombre…). Les premières ressources intégrées à Loflòc pour le languedocien sont le Dictionnaire Occitan-Français Languedocien de Laux (2001), Dictionnaire Français-Occitan Languedocien de Laux (2005) ainsi que les données de l'application vèrb'Òc, conjugueur édité par Lo Congrès (Sauzet et Ubaud, 1995 ; Sauzet, 2016). En effet, ceux-ci ayant été normalisés au format XML (norme TEI P5) par le Congrès, il a été possible d’en extraire automatiquement les lemmes, leurs flexions et les informations grammaticales nécessaires. En ce qui concerne la structure et le choix des standards pour Loflòc, nous nous inspirons des lexiques français tels que Morphalou (Romary, et al, 2004) et GlaFF (Sajous, et al., 2013). Nous avons adopté les étiquettes du standard Eagles/Multext/Grace (Rajman et al., 1997) que nous avons gardées en anglais tout en les adaptant aux spécificités de l'occitan. Cela facilitera la comparaison de notre lexique aux lexiques des langues proches qui ont également adopté des jeux d'étiquettes semblables et comparables (français, catalan). Nous présenterons dans la communication le lexique, sa structure, son contenu, ainsi que les différents types d’application qui ne peuvent être développées sans un lexique de ce type (analyseurs morpho-syntaxiques, analyseurs syntaxiques, traducteurs automatiques, outils de recherche d’information, outils d’aide à la rédaction de textes ou sms, correcteurs orthographiques, etc.). Bibliographie : Bernhard, D., et Vergez-Couret, M. (2016). Le projet RESTAURE. In Les technologies pour les langues régionales de France, 82 90. Condé-sur-Noireau: DGLFLF. Bras, M., Thomas, J. (2011). « Batelòc : cap a una basa informatisada de tèxtes occitans », in A. Rieger & D. Sumien (eds). Occitània convidada d’Euregio. Lièja 1981 - Aquisgran 2008 : Bilanç e amiras. Actes du Neuvième Congrès International de l’Association Internationale d’Études Occitanes, Aix-la-Chapelle, 24-31 août 2008, Aachen, Shaker. Bras, M. & Vergez-Couret, M. (2016). « BaTelÒc: A text base for the Occitan language. », in Vera Ferreira and Peter Bouda (eds.) Language Documentation and Conservation in Europe, Honolulu: University of Hawai'i Press, pp. 133-149. Dazéas, B. (2015). Feuille de route pour le développement numérique occitan. In Actes de la Traitement Automatique des Langues Régionales de France et d’Europe, Caen. Laux C. (2001). Dictionnaire occitan-français : languedocien, avec la collab. de Serge Granier, Puylaurens, IEO, Section du Tarn. Laux C. (2005). Dictionnaire Français-Occitan. Castres : IEO del Tarn. Lo Congrès (2014). Diagnostic e huelha de rota tau desvolopament numeric de la lenga occitana 2015-2019, rapòrt finau deu projècte. Media.kom, elhuyar. http://locongres.org/images/docs/huelha_rota_numeric_occitan_oc.pdf. Rajman M. (1997). Format de description lexicale pour le français – Partie 2 : description morphosyntaxique, technical report GRACE, http://www.limsi.fr/grace/. Romary L., Salmon-Alt S., Francopoulo G. (2004). Standards going concrete : from LMF to Morphalou. Workshop on Electronic Dictionaries, Coling 2004, Geneva, Switzerland. Sajous, F., Hathout, N., Calderone, B. (2013). 'GLÀFF, un Gros Lexique À tout Faire du Français'. Actes de la conférence Traitement Automatique des Langues Naturelles (TALN 2013). Sauzet, P., Ubaud, J. (1995). Le verbe occitan. Lo vèrb occitan. Aix-en-Provence : Édisud. Sauzet, P. (2016). Conjugaison occitane. IEO edicions. Séguier, A., et Mercadier, G. (2016). Le numérique au service de la transmission de la langue occitane : situation et perspectives de développement ». In Les technologies pour les langues régionales de France, 82 90. Condé-sur-Noireau: DGLFLF. Vergez-Couret, M., et Urieli, A. (2015). Analyse morphosyntaxique de l’occitan languedocien : l’amitié entre un petit languedocien et un gros catalan. In Actes du Workshop Traitement Automatique des Langues Régionales de France et d’Europe, Caen.
Explorer
Corpus
-
Texte
(7)
-
Annotated
(3)
- Morphology (3)
- Web (1)
-
Annotated
(3)
Tâche
Type de papier
- Classification des langues (1)
- Projet (3)