Bibliographie complète | Bibliographie COLaF

DG, G. (2025). gweltou/anaouder-gui. https://github.com/gweltou/anaouder-gui (Original work published 2024)

Align text with Breton speech, create subtitles

Consulter sur github.com

gweltou/breton-tts · Hugging Face. (2025, May 9). https://huggingface.co/gweltou/breton-tts

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Consulter sur huggingface.co

DG, G. (2025). gweltou/anaouder-cli. https://github.com/gweltou/anaouder-cli (Original work published 2022)

Anaouder mouezh e Brezhoneg gant Vosk

Consulter sur github.com

Schöffel, M., Wiedner, M., Arias, E. G., Ruppert, P., Heumann, C., & Aßenmacher, M. (2025). Modern Models, Medieval Texts: A POS Tagging Study of Old Occitan (No. arXiv:2503.07827). arXiv. https://doi.org/10.48550/arXiv.2503.07827

Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing, yet their effectiveness in handling historical languages remains largely unexplored. This study examines the performance of open-source LLMs in part-of-speech (POS) tagging for Old Occitan, a historical language characterized by non-standardized orthography and significant diachronic variation. Through comparative analysis of two distinct corpora-hagiographical and medical texts-we evaluate how current models handle the inherent challenges of processing a low-resource historical language. Our findings demonstrate critical limitations in LLM performance when confronted with extreme orthographic and syntactic variability. We provide detailed error analysis and specific recommendations for improving model performance in historical language processing. This research advances our understanding of LLM capabilities in challenging linguistic contexts while offering practical insights for both computational linguistics and historical language studies.

Consulter le document

Bernhard, D., & Dolińska, J. (2025). Managing Noise in Part-of-Speech Tagging for Extremely Low-Resource Languages: Comparing Strategies for Corpus Collection and Annotation in Dagur and Alsatian. Corpus, 26. https://doi.org/10.4000/1364t

Bien que le dagur et l’alsacien représentent deux familles de langues typologiquement éloignées, ils partagent plusieurs similitudes : les deux langues sont en danger, n’ont pas de système orthographique unifié, et ont peu de corpus numériques disponibles. Compte tenu de ces défis, l’objectif principal de cet article est de comparer le bruit dans les corpus de ces deux langues et son impact sur l’annotation et l’étiquetage des parties du discours (POS). Nous discutons d’abord des stratégies qui peuvent être utilisées pour réduire le bruit dû aux incohérences orthographiques observées lors de la collecte des corpus, en utilisant le dagur comme exemple. Nous observons ensuite que les distributions des trigrammes POS dans les corpus manuellement annotés de dagur et d’alsacien sont similaires à celles des langues typologiquement apparentées dans UD v2.12, ce qui justifie l’expérimentation d’approches de transfert zéro-shot pour l’étiquetage morphosyntaxique. Nous évaluons quelques stratégies simples de réduction du bruit pour l’étiquetage morphosyntaxique en utilisant l’exemple des dialectes alsaciens et en nous basant sur leur proximité avec l’allemand standard. Les résultats obtenus confirment le rôle important de la proximité linguistique dans l’étiquetage morphosyntaxique et l’efficacité de la méthode de transformation des données que nous proposons. Cependant, ils invitent également à une interprétation plus poussée des capacités des modèles multilingues.

Consulter le document

Bernhard, D., Vergez-Couret, M., & Dupuy, E. (2024). Au-delà des normes : identifier et documenter les langues minorisées pour le traitement automatique des langues. Cahiers Du Plurilinguisme Européen, 16. https://doi.org/10.57086/cpe.1710

Dieser Artikel stellt Überlegungen zu den Herausforderungen der Dokumentation von Minderheitensprachen im digitalen Raum an, ausgehend von den Arbeiten, die im Rahmen des DIVITAL-Projekts durchgeführt wurden. Die ersten Arbeiten des Projekts betrafen die Sammlung von Korpora und ihre Dokumentation durch feinkörnige Metadaten. Diese Arbeiten haben zwei große Herausforderungen aufgezeigt: (i) die Identifizierung der Sprachen und ihrer Varianten im Rahmen der Normen für die Kodierung von Sprachnamen und (ii) die Schaffung neuer Ressourcen in Verbindung mit der aktuellen Praxis dieser Sprachen.

Consulter le document

Bigeard, S., Tsolakis, P., Vincent, E., Colotte, V., Erhart, P., & Ouni, S. (2024, November). Retour d’expérience : Whisper pour les langues régionales. LIFT 2: Journées Scientifiques Du GdR Linguistique Informatique, Formelle et de Terrain. https://hal.science/hal-04787239

Notre objectif est de développer un système de reconnaissance automatique de la parole (ASR) de langues régionales. Pour cela, nous explorons la spécialisation ou l’adaptation de Whisper par affinage (fine-tuning). Dans cet article, nous présentons un retour d’expérience sur des travaux en cours dans deux langues : le basque et l’alsacien.

Consulter sur hal.science

Bernhard, D., Binot, J., & Werner, C. (2024, October). Mistral sur les Vosges : L’IA souffle-t-elle dans la bonne direction pour l’alsacien ? https://hal.science/hal-04869156

Ce poster explore les défis de l'annotation syntaxique pour l'alsacien, une langue peu dotée, en comparant deux approches novatrices. D'un côté, nous examinons l'utilisation des grands modèles de langue (LLMs) génératifs, tels que ChatGPT ou Mistral, qui promettent une couverture linguistique large mais potentiellement superficielle. De l'autre, nous étudions des modèles plus légers de type encodeur, entraînés spécifiquement sur des langues proches de l'alsacien. Notre analyse met en lumière les forces et les faiblesses de chaque méthode, en examinant leur efficacité et leur capacité à saisir les subtilités de la syntaxe alsacienne. L'objectif est de déterminer si la "wunderbàr" technologie des LLMs écrase la concurrence, ou si les modèles plus modestes, nourris à la "choucroute neuronale" des langues voisines, peuvent rivaliser pour dompter la grammaire alsacienne. Cette recherche vise ainsi à ouvrir de nouvelles perspectives pour l'annotation syntaxique des langues peu dotées et à contribuer au développement d'outils linguistiques plus performants pour l'alsacien. Préparez-vous à assister à un combat épique entre modèles d'IA pour conquérir la syntaxe alsacienne !

Consulter le document

Lent, H., Tatariya, K., Dabre, R., Chen, Y., Fekete, M., Ploeger, E., Zhou, L., Armstrong, R.-A., Eijansantos, A., Malau, C., Heje, H. E., Lavrinovics, E., Kanojia, D., Belony, P., Bollmann, M., Grobol, L., Lhoneux, M. D., Hershcovich, D., DeGraff, M., … Bjerva, J. (2024). CreoleVal: Multilingual Multitask Benchmarks for Creoles. Transactions of the Association for Computational Linguistics, 12, 950–978. https://doi.org/10.1162/tacl_a_00682

Abstract Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research. While the genealogical ties between Creoles and a number of highly resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it is an aggregate of novel development datasets for reading comprehension relation classification, and machine translation for Creoles, in addition to a practical gateway to a handful of preexisting benchmarks. For each benchmark, we conduct baseline experiments in a zero-shot setting in order to further ascertain the capabilities and limitations of transfer learning for Creoles. Ultimately, we see CreoleVal as an opportunity to empower research on Creoles in NLP and computational linguistics, and in general, a step towards more equitable language technology around the globe.

Consulter le document

Manual of Romance Word Classes. (2024). De Gruyter. https://doi.org/10.1515/9783110746389

Word classes are linguistic categories serving as basis in the description of the vocabulary and grammar of natural languages. While important publications are regularly devoted to their definition, identification, and classification, in the field of Romance linguistics we lack a comprehensive, state-of-the-art overview of the current research. This Manual offers an updated and detailed discussion of all relevant aspects related to word classes in the Romance languages. In the first part, word classes are discussed from both a theoretical and historical point of view. The second part of the volume takes as its point of departure single word classes, described transversally in all the main Romance languages, while the third observes the relevant word classes from the point of view of specific Romance(-based) varieties. The fourth part explores Romance word classes at the interface of grammar and other fields of research. The Manual is intended as a reference work for all scholars and students interested in the description of both the standard, major Romance languages and the smaller, lesser described Romance(-based) varieties.

Consulter le document

Li, J., Pu, Y., Sun, Q., & Zhang, W.-Q. (2024). Improving Whisper’s Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text (No. arXiv:2408.05554). arXiv. https://doi.org/10.48550/arXiv.2408.05554

Whisper and other large-scale automatic speech recognition models have made significant progress in performance. However, their performance on many low-resource languages, such as Kazakh, is not satisfactory. It is worth researching how to utilize low-cost data to improve the performance of Whisper on under-represented languages. In this study, we utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh. We implemented end of transcript (EOT) judgment modification and hallucination penalty to improve the performance of speech recognition. Further, we employed the decoding average token log probability as a criterion to select samples from unlabeled speech data and used pseudo-labeled data to fine-tune the model to further improve its performance. Ultimately, we achieved more than 10\% absolute WER reduction in multiple experiments, and the whole process has the potential to be generalized to other under-represented languages.

Consulter sur arxiv.org

Dent, R., Janès, J., Clérice, T., Suarez, P. O., & Sagot, B. (2024). Moly\’e: A Corpus-based Approach to Language Contact in Colonial France (No. arXiv:2408.04554). arXiv. https://doi.org/10.48550/arXiv.2408.04554

Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of intermediate forms. This work introduces a new open corpus, the Moly\'e corpus, which combines stereotypical representations of three kinds of language variation in Europe with early attestations of French-based Creole languages across a period of 400 years. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.

Consulter le document

Esher, L., & Sibille, J. (2024). Manuel de linguistique occitane (De Gruyter). De Gruyter. https://doi.org/10.1515/9783110733433

The Manual offers the first comprehensive overview on Occitan and Occitan linguistics, including the latest research. With 26 contributions organized in seven parts, it covers diachronic and synchronic language description, diatopic variation and sociolinguistics, language planning and equipment, as well as teaching and language pedagogy.

Consulter le document

Çelikkol, M., Körber, L., & Zhao, W. (2024). Exploring Diachronic and Diatopic Changes in Dialect Continua: Tasks, Datasets and Challenges (No. arXiv:2407.04010). arXiv. https://doi.org/10.48550/arXiv.2407.04010

Everlasting contact between language communities leads to constant changes in languages over time, and gives rise to language varieties and dialects. However, the communities speaking non-standard language are often overlooked by non-inclusive NLP technologies. Recently, there has been a surge of interest in studying diatopic and diachronic changes in dialect NLP, but there is currently no research exploring the intersection of both. Our work aims to fill this gap by systematically reviewing diachronic and diatopic papers from a unified perspective. In this work, we critically assess nine tasks and datasets across five dialects from three language families (Slavic, Romance, and Germanic) in both spoken and written modalities. The tasks covered are diverse, including corpus construction, dialect distance estimation, and dialect geolocation prediction, among others. Moreover, we outline five open challenges regarding changes in dialect use over time, the reliability of dialect datasets, the importance of speaker characteristics, limited coverage of dialects, and ethical considerations in data collection. We hope that our work sheds light on future research towards inclusive computational methods and datasets for language varieties and dialects.

Consulter le document

Vergez-Couret, M., & Miletic, A. (2024). Tokenization for Occitan (Gascon and Lengadocian). Zenodo. https://doi.org/10.5281/zenodo.12515136

A python programme to tokenise texts in Occitan based on rules. To launch the programme, execute the following instruction: python3 tokenizer_occitan.py < input.txt > output.conllu The script takes as input a text file with a single sentence per line, starting by a sentence ID, followed by a tab character, followed by the sentence itself. The current version of the tool was developped during the projects DIVITAL (funded by the ANR) and CorCoDial (funded by the Academy of Finland).

Consulter sur zenodo.org

Gong, C., Cooper, E., Wang, X., Qiang, C., Geng, M., Wells, D., Wang, L., Dang, J., Tessier, M., Pine, A., Richmond, K., & Yamagishi, J. (2024). An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios (No. arXiv:2406.08911). arXiv. https://doi.org/10.48550/arXiv.2406.08911

Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on 12 languages using limited data with various fine-tuning configurations. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS.

Consulter sur arxiv.org

Kocmi, T., Zouhar, V., Federmann, C., & Post, M. (2024). Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies (No. arXiv:2401.06760). arXiv. https://doi.org/10.48550/arXiv.2401.06760

Ten years ago, a single metric, BLEU, governed progress in machine translation research. For better or worse, there is no such consensus today, and consequently it is difficult for researchers to develop and retain intuitions about metric deltas that drove earlier research and deployment decisions. This paper investigates the “dynamic range” of a number of modern metrics in an effort to provide a collective understanding of the meaning of differences in scores both within and among metrics; in other words, we ask what point difference x in metric y is required between two systems for humans to notice? We conduct our evaluation on a new large dataset, ToShip23, using it to discover deltas at which metrics achieve system-level differences that are meaningful to humans, which we measure by pairwise system accuracy. We additionally show that this method of establishing delta-accuracy is more stable than the standard use of statistical p-values in regards to testset size. Where data size permits, we also explore the effect of metric deltas and accuracy across finer-grained features such as translation direction, domain, and system closeness.

Consulter sur arxiv.org

Lux, F., Meyer, S., Behringer, L., Zalkow, F., Do, P., Coler, M., Habets, E. A. P., & Vu, N. T. (2024, June 10). Meta Learning Text-to-Speech Synthesis in over 7000 Languages. ArXiv.Org. https://arxiv.org/abs/2406.06403v1

In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech synthesis in languages without any available data. We validate our system's performance through objective measures and human evaluation across a diverse linguistic landscape. By releasing our code and models publicly, we aim to empower communities with limited linguistic resources and foster further innovation in the field of speech technology.

Consulter sur arxiv.org

Coleman, J., Krishnamachari, B., Rosales, R., & Iskarous, K. (2024). LLM-Assisted Rule Based Machine Translation for Low/No-Resource Languages. In M. Mager, A. Ebrahimi, S. Rijhwani, A. Oncevay, L. Chiruzzo, R. Pugh, & K. von der Wense (Eds.), Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024) (pp. 67–87). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.americasnlp-1.9

We propose a new paradigm for machine translation that is particularly useful for no-resource languages (those without any publicly available bilingual or monolingual corpora): LLM-RBMT (LLM-Assisted Rule Based Machine Translation). Using the LLM-RBMT paradigm, we design the first language education/revitalization-oriented machine translator for Owens Valley Paiute (OVP), a critically endangered Indigenous American language for which there is virtually no publicly available data. We present a detailed evaluation of the translator's components: a rule-based sentence builder, an OVP to English translator, and an English to OVP translator. We also discuss the potential of the paradigm, its limitations, and the many avenues for future research that it opens up.

Consulter le document

Hopton, Z., & Aepli, N. (2024). Modeling Orthographic Variation in Occitan’s Dialects. In Y. Scherrer, T. Jauhiainen, N. Ljubešić, M. Zampieri, P. Nakov, & J. Tiedemann (Eds.), Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024) (pp. 78–88). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.vardial-1.6

Effectively normalizing spellings in textual data poses a considerable challenge, especially for low-resource languages lacking standardized writing systems. In this study, we fine-tuned a multilingual model with data from several Occitan dialects and conducted a series of experiments to assess the model's representations of these dialects. For evaluation purposes, we compiled a parallel lexicon encompassing four Occitan dialects.Intrinsic evaluations of the model's embeddings revealed that surface similarity between the dialects strengthened representations. When the model was further fine-tuned for part-of-speech tagging, its performance was robust to dialectical variation, even when trained solely on part-of-speech data from a single dialect. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.

Consulter le document

Rechercher

Bibliographie complète 166 ressources

Explorer

Corpus

Langue

Tâche

Type de papier