Rechercher
Bibliographie complète 164 ressources
-
Anaouder mouezh e Brezhoneg gant Vosk
-
Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing, yet their effectiveness in handling historical languages remains largely unexplored. This study examines the performance of open-source LLMs in part-of-speech (POS) tagging for Old Occitan, a historical language characterized by non-standardized orthography and significant diachronic variation. Through comparative analysis of two distinct corpora-hagiographical and medical texts-we evaluate how current models handle the inherent challenges of processing a low-resource historical language. Our findings demonstrate critical limitations in LLM performance when confronted with extreme orthographic and syntactic variability. We provide detailed error analysis and specific recommendations for improving model performance in historical language processing. This research advances our understanding of LLM capabilities in challenging linguistic contexts while offering practical insights for both computational linguistics and historical language studies.
-
Bien que le dagur et l’alsacien représentent deux familles de langues typologiquement éloignées, ils partagent plusieurs similitudes : les deux langues sont en danger, n’ont pas de système orthographique unifié, et ont peu de corpus numériques disponibles. Compte tenu de ces défis, l’objectif principal de cet article est de comparer le bruit dans les corpus de ces deux langues et son impact sur l’annotation et l’étiquetage des parties du discours (POS). Nous discutons d’abord des stratégies qui peuvent être utilisées pour réduire le bruit dû aux incohérences orthographiques observées lors de la collecte des corpus, en utilisant le dagur comme exemple. Nous observons ensuite que les distributions des trigrammes POS dans les corpus manuellement annotés de dagur et d’alsacien sont similaires à celles des langues typologiquement apparentées dans UD v2.12, ce qui justifie l’expérimentation d’approches de transfert zéro-shot pour l’étiquetage morphosyntaxique. Nous évaluons quelques stratégies simples de réduction du bruit pour l’étiquetage morphosyntaxique en utilisant l’exemple des dialectes alsaciens et en nous basant sur leur proximité avec l’allemand standard. Les résultats obtenus confirment le rôle important de la proximité linguistique dans l’étiquetage morphosyntaxique et l’efficacité de la méthode de transformation des données que nous proposons. Cependant, ils invitent également à une interprétation plus poussée des capacités des modèles multilingues.
-
Cet article propose une réflexion sur les défis de la documentation des langues minorisées dans l’espace numérique à partir des travaux réalisés dans le cadre du projet DIVITAL. Les premiers travaux du projet ont concerné la collecte de corpus et leur documentation par des métadonnées à grain fin. Ces travaux ont mis en évidence deux défis majeurs : (i) l’identification des langues et de leurs variantes, dans le cadre des normes de codification des noms de langues, et (ii) la création de nouvelles ressources en lien avec les pratiques actuelles de ces langues. , This article looks at the challenges of documenting minority languages in the digital environment, based on work carried out as part of the DIVITAL project. The project’s initial work involved collecting corpora and documenting them using fine-grained metadata. This work has highlighted two major challenges: (i) the identification of languages and their variants, within the framework of standards for the codification of language names, and (ii) the creation of new resources linked to the current practices of these languages. , Dieser Artikel stellt Überlegungen zu den Herausforderungen der Dokumentation von Minderheitensprachen im digitalen Raum an, ausgehend von den Arbeiten, die im Rahmen des DIVITAL-Projekts durchgeführt wurden. Die ersten Arbeiten des Projekts betrafen die Sammlung von Korpora und ihre Dokumentation durch feinkörnige Metadaten. Diese Arbeiten haben zwei große Herausforderungen aufgezeigt: (i) die Identifizierung der Sprachen und ihrer Varianten im Rahmen der Normen für die Kodierung von Sprachnamen und (ii) die Schaffung neuer Ressourcen in Verbindung mit der aktuellen Praxis dieser Sprachen.
-
Dieser Artikel stellt Überlegungen zu den Herausforderungen der Dokumentation von Minderheitensprachen im digitalen Raum an, ausgehend von den Arbeiten, die im Rahmen des DIVITAL-Projekts durchgeführt wurden. Die ersten Arbeiten des Projekts betrafen die Sammlung von Korpora und ihre Dokumentation durch feinkörnige Metadaten. Diese Arbeiten haben zwei große Herausforderungen aufgezeigt: (i) die Identifizierung der Sprachen und ihrer Varianten im Rahmen der Normen für die Kodierung von Sprachnamen und (ii) die Schaffung neuer Ressourcen in Verbindung mit der aktuellen Praxis dieser Sprachen.
-
Notre objectif est de développer un système de reconnaissance automatique de la parole (ASR) de langues régionales. Pour cela, nous explorons la spécialisation ou l’adaptation de Whisper par affinage (fine-tuning). Dans cet article, nous présentons un retour d’expérience sur des travaux en cours dans deux langues : le basque et l’alsacien.
-
Ce poster explore les défis de l'annotation syntaxique pour l'alsacien, une langue peu dotée, en comparant deux approches novatrices. D'un côté, nous examinons l'utilisation des grands modèles de langue (LLMs) génératifs, tels que ChatGPT ou Mistral, qui promettent une couverture linguistique large mais potentiellement superficielle. De l'autre, nous étudions des modèles plus légers de type encodeur, entraînés spécifiquement sur des langues proches de l'alsacien. Notre analyse met en lumière les forces et les faiblesses de chaque méthode, en examinant leur efficacité et leur capacité à saisir les subtilités de la syntaxe alsacienne. L'objectif est de déterminer si la "wunderbàr" technologie des LLMs écrase la concurrence, ou si les modèles plus modestes, nourris à la "choucroute neuronale" des langues voisines, peuvent rivaliser pour dompter la grammaire alsacienne. Cette recherche vise ainsi à ouvrir de nouvelles perspectives pour l'annotation syntaxique des langues peu dotées et à contribuer au développement d'outils linguistiques plus performants pour l'alsacien. Préparez-vous à assister à un combat épique entre modèles d'IA pour conquérir la syntaxe alsacienne !
-
Abstract Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research. While the genealogical ties between Creoles and a number of highly resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it is an aggregate of novel development datasets for reading comprehension relation classification, and machine translation for Creoles, in addition to a practical gateway to a handful of preexisting benchmarks. For each benchmark, we conduct baseline experiments in a zero-shot setting in order to further ascertain the capabilities and limitations of transfer learning for Creoles. Ultimately, we see CreoleVal as an opportunity to empower research on Creoles in NLP and computational linguistics, and in general, a step towards more equitable language technology around the globe.
-
Word classes are linguistic categories serving as basis in the description of the vocabulary and grammar of natural languages. While important publications are regularly devoted to their definition, identification, and classification, in the field of Romance linguistics we lack a comprehensive, state-of-the-art overview of the current research. This Manual offers an updated and detailed discussion of all relevant aspects related to word classes in the Romance languages. In the first part, word classes are discussed from both a theoretical and historical point of view. The second part of the volume takes as its point of departure single word classes, described transversally in all the main Romance languages, while the third observes the relevant word classes from the point of view of specific Romance(-based) varieties. The fourth part explores Romance word classes at the interface of grammar and other fields of research. The Manual is intended as a reference work for all scholars and students interested in the description of both the standard, major Romance languages and the smaller, lesser described Romance(-based) varieties.
-
Whisper and other large-scale automatic speech recognition models have made significant progress in performance. However, their performance on many low-resource languages, such as Kazakh, is not satisfactory. It is worth researching how to utilize low-cost data to improve the performance of Whisper on under-represented languages. In this study, we utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh. We implemented end of transcript (EOT) judgment modification and hallucination penalty to improve the performance of speech recognition. Further, we employed the decoding average token log probability as a criterion to select samples from unlabeled speech data and used pseudo-labeled data to fine-tune the model to further improve its performance. Ultimately, we achieved more than 10\% absolute WER reduction in multiple experiments, and the whole process has the potential to be generalized to other under-represented languages.
-
Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of intermediate forms. This work introduces a new open corpus, the Moly\'e corpus, which combines stereotypical representations of three kinds of language variation in Europe with early attestations of French-based Creole languages across a period of 400 years. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.
-
The Manual offers the first comprehensive overview on Occitan and Occitan linguistics, including the latest research. With 26 contributions organized in seven parts, it covers diachronic and synchronic language description, diatopic variation and sociolinguistics, language planning and equipment, as well as teaching and language pedagogy.
-
Everlasting contact between language communities leads to constant changes in languages over time, and gives rise to language varieties and dialects. However, the communities speaking non-standard language are often overlooked by non-inclusive NLP technologies. Recently, there has been a surge of interest in studying diatopic and diachronic changes in dialect NLP, but there is currently no research exploring the intersection of both. Our work aims to fill this gap by systematically reviewing diachronic and diatopic papers from a unified perspective. In this work, we critically assess nine tasks and datasets across five dialects from three language families (Slavic, Romance, and Germanic) in both spoken and written modalities. The tasks covered are diverse, including corpus construction, dialect distance estimation, and dialect geolocation prediction, among others. Moreover, we outline five open challenges regarding changes in dialect use over time, the reliability of dialect datasets, the importance of speaker characteristics, limited coverage of dialects, and ethical considerations in data collection. We hope that our work sheds light on future research towards inclusive computational methods and datasets for language varieties and dialects.
-
A python programme to tokenise texts in Occitan based on rules. To launch the programme, execute the following instruction: python3 tokenizer_occitan.py < input.txt > output.conllu The script takes as input a text file with a single sentence per line, starting by a sentence ID, followed by a tab character, followed by the sentence itself. The current version of the tool was developped during the projects DIVITAL (funded by the ANR) and CorCoDial (funded by the Academy of Finland).
-
Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on 12 languages using limited data with various fine-tuning configurations. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS.
-
In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech synthesis in languages without any available data. We validate our system's performance through objective measures and human evaluation across a diverse linguistic landscape. By releasing our code and models publicly, we aim to empower communities with limited linguistic resources and foster further innovation in the field of speech technology.
-
We propose a new paradigm for machine translation that is particularly useful for no-resource languages (those without any publicly available bilingual or monolingual corpora): LLM-RBMT (LLM-Assisted Rule Based Machine Translation). Using the LLM-RBMT paradigm, we design the first language education/revitalization-oriented machine translator for Owens Valley Paiute (OVP), a critically endangered Indigenous American language for which there is virtually no publicly available data. We present a detailed evaluation of the translator's components: a rule-based sentence builder, an OVP to English translator, and an English to OVP translator. We also discuss the potential of the paradigm, its limitations, and the many avenues for future research that it opens up.
-
Effectively normalizing spellings in textual data poses a considerable challenge, especially for low-resource languages lacking standardized writing systems. In this study, we fine-tuned a multilingual model with data from several Occitan dialects and conducted a series of experiments to assess the model's representations of these dialects. For evaluation purposes, we compiled a parallel lexicon encompassing four Occitan dialects.Intrinsic evaluations of the model's embeddings revealed that surface similarity between the dialects strengthened representations. When the model was further fine-tuned for part-of-speech tagging, its performance was robust to dialectical variation, even when trained solely on part-of-speech data from a single dialect. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
-
In this paper we present Matignon-LSF, the first dataset of interpreted French Sign Language (LSF) and one of the largest LSF dataset available for research to date. This is a dataset of live interpreted LSF during public speeches by the French government. The dataset comprises 39 hours of LSF videos with French language audio and corresponding subtitles. In addition to this data, we offer pre-computed video features (I3D). We provide a detailed analysis of the proposed dataset as well as some experimental results to demonstrate the interest of this novel dataset.
-
This article presents a new bilingual dataset in written French and French Sign Language (LSF), called STK LSF. This corpus is currently being produced as part of the ANR SignToKids project. The aim of this corpus is to provide digital educational tools for deaf children, thereby facilitating the joint learning of LSF and written French. More broadly, it is intended to support future studies on the automatic processing of signed languages. To define this corpus, we focused on several grammatical phenomena typical to LSF, as well as in tales usually studied by hearing children in the second cycle in France. The corpus data represent approximately 1 hour of recording, carried out with a motion capture system (MoCap) offering a spatial precision of less than 1 mm and a temporal precision of 240 Hz. This high level of precision guarantees the quality of the data collected, which will be used both to build pedagogical scenarios in French and LSF, including signing avatar videos, and for automatic translation of text into LSF.
Explorer
Corpus
- Langue des signes française (4)
- Parole (4)
-
Texte
(21)
-
Annotated
(9)
- Morphology (6)
- Parallel (2)
- Syntax (1)
- Web (7)
-
Annotated
(9)
Langue
- Alsacien (12)
- Breton (9)
- Corse (5)
- Créoles (3)
- Français (4)
- Guyane (1)
-
Multilingue
(15)
- Langues COLaF (9)
- Occitan (36)
- Picard (7)
- Poitevin-Saintongeais (3)
Tâche
Type de papier
- Classification des langues (9)
- Etat de l'art (3)
- Inventaire (2)
- Normalisation (3)
- Papiers COLaF (2)
- Prise de position (12)
- Projet (6)