Rechercher
Bibliographie complète 142 ressources
-
Dans l’esprit d’une majorité de Français, les langues dites régionales ne seraient que des « patois », de vulgaires déformations du français, de vagues idiomes tout juste bons à décrire des banalités. Pourquoi devraient-ils s’émouvoir de leur effacement ? Or, tous les linguistes le savent : le basque, le breton, l’alsacien, le corse, le picard et les autres, n’ont rien à envier au français, à l’anglais, à l’arabe ou au mandarin. La seule différence entre les « petites langues » et les autres, c’est que les premières n’ont pas eu la chance de devenir des langues officielles d’un État. Cet ouvrage affiche une ambition assumée : réconcilier la France avec sa diversité. Pour que le français reste notre langue commune, sans devenir notre langue unique.
-
This paper presents OcWikiDisc, a new freely available corpus in Occitan, as well as language identification experiments on Occitan done as part of the corpus building process. Occitan is a regional language spoken mainly in the south of France and in parts of Spain and Italy. It exhibits rich diatopic variation, it is not standardized, and it is still low-resourced, especially when it comes to large downloadable corpora. We introduce OcWikiDisc, a corpus extracted from the talk pages associated with the Occitan Wikipedia. The version of the corpus with the most restrictive language filtering contains 8K user messages for a total of 618K tokens. The language filtering is performed based on language identification experiments with five off-the-shelf tools, including the new fasttext's language identification model from Meta AI's No Language Left Behind initiative, released in July 2022.
-
Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system. Finally, we open source all contributions described in this work, accessible at https://github.com/facebookresearch/fairseq/tree/nllb.
-
This paper describes a method of semi-automatic word spotting in minority languages, from one and the same Aesop fable “The North Wind and the Sun” translated in Romance languages/dialects from Hexagonal (i.e. Metropolitan) France and languages from French Polynesia. The first task consisted of finding out how a dozen words such as “wind” and “sun” were translated in over 200 versions collected in the field — taking advantage of orthographic similarity, word position and context. Occurrences of the translations were then extracted from the phone-aligned recordings. The results were judged accurate in 96–97% of cases, both on the development corpus and a test set of unseen data. Corrected alignments were then mapped and basemaps were drawn to make various linguistic phenomena immediately visible. The paper exemplifies how regular expressions may be used for this purpose. The final result, which takes the form of an online speaking atlas (enriching the https://atlas.limsi.fr website), enables us to illustrate lexical, morphological or phonetic variation.
-
Machine translation has been researched using deep neural networks in recent years. These networks require lots of data to learn abstract representations of the input stored in continuous vectors. Dialect translation has become more important since the advent of social media. In particular, when dialect speakers and standard language speakers no longer understand each other, machine translation is of rising concern. Usually, dialect translation is a typical low-resourced language setting facing data scarcity problems. Additionally, spelling inconsistencies due to varying pronunciations and the lack of spelling rules complicate translation. This paper presents the best-performing approaches to handle these problems for Alemannic dialects. The results show that back-translation and conditioning on dialectal manifestations achieve the most remarkable enhancement over the baseline. Using back-translation, a significant gain of +4.5 over the strong transformer baseline of 37.3 BLEU points is accomplished. Differentiating between several Alemannic dialects instead of treating Alemannic as one dialect leads to substantial improvements: Multi-dialectal translation surpasses the baseline on the dialectal test sets. However, training individual models outperforms the multi-dialectal approach. There, improvements range from 7.5 to 10.6 BLEU points over the baseline depending on the dialect.
-
Poète, pédagogue, lexicographe, fabuliste et traducteur, Hector Poullet est un acteur majeur de la scène locale et un incorrigible polygraphe. A un âge où d’autres se retirent, il se voit aujourd’hui encore sollicité pour enseigner la langue ou pour assurer des conférences savantes. L’examen de sa bibliographie accuse un compte de trente-cinq ouvrages, publiés seul ou en équipe, dont on trouve ici un relevé. A cette production respectable, il faut ajouter nombre de préfaces, de traductions, de chapitres de livres, d’articles inclus dans des sommes collectives ou sur divers sites. Suivant la proposition de la Direction des Affaires Culturelles, nous avons conçu un film d’entretien, où l’on entend cette personnalité évoquer les épisodes significatifs de l’histoire récente de la Guadeloupe. La présente contribution ne pouvait se concevoir ni comme le verbatim de ce dialogue, ni comme le simple script du film. Nous choisissons donc d’éclairer deux épisodes mal connus de son action dans l’entrée du créole au collège et dans l’élaboration du premier dictionnaire du guadeloupéen. Itinérance et diversité Né en 1938, Hector aime rappeler la diversité de ses origines et le nomadisme de son enfance, avec un père originaire de Grande Terre et une mère de Basse Terre. Il vit une enfance de fils d’instituteur itinérant, balloté au gré des affectations parentales, aux quatre coins de l’archipel. Le bac en poche, il va à Paris pour poursuivre des études de sciences. Au fil d’un séjour qui s
-
Languages are classified as low-resource when they lack the quantity of data necessary for training statistical and machine learning tools and models. Causes of resource scarcity vary but can include poor access to technology for developing these resources, a relatively small population of speakers, or a lack of urgency for collecting such resources in bilingual populations where the second language is high-resource. As a result, the languages described as low-resource in the literature are as different as Finnish on the one hand, with millions of speakers using it in every imaginable domain, and Seneca, with only a small-handful of fluent speakers using the language primarily in a restricted domain. While issues stemming from the lack of resources necessary to train models unite this disparate group of languages, many other issues cut across the divide between widely-spoken low-resource languages and endangered languages. In this position paper, we discuss the unique technological, cultural, practical, and ethical challenges that researchers and indigenous speech community members face when working together to develop language technology to support endangered language documentation and revitalization. We report the perspectives of language teachers, Master Speakers and elders from indigenous communities, as well as the point of view of academics. We describe an ongoing fruitful collaboration and make recommendations for future partnerships between academic researchers and language community stakeholders.
-
Apertium linguistic data for Occitan
-
L’Alsace peut être qualifiée de province au même titre que la Bourgogne ou la Franche-Comté, par exemple ; il y aura peu de désaccord à propos d’une telle dénomination. Mais quand il s’agit de définir la situation linguistique de l’Alsace, des difficultés nombreuses surgissent qui sont liées au fait qu’il ne s’agit pas seulement de repérer des langues en usage par rapport à une classification des langues fondée sur des critères scientifiques mais qu’il s’agit aussi de les situer les unes par ...
-
In this work, we investigate methods for the challenging task of translating between low- resource language pairs that exhibit some level of similarity. In particular, we consider the utility of transfer learning for translating between several Indo-European low-resource languages from the Germanic and Romance language families. In particular, we build two main classes of transfer-based systems to study how relatedness can benefit the translation performance. The primary system fine-tunes a model pre-trained on a related language pair and the contrastive system fine-tunes one pre-trained on an unrelated language pair. Our experiments show that although relatedness is not necessary for transfer learning to work, it does benefit model performance.
-
Cross-lingual word embeddings (CLWEs) have proven indispensable for various natural language processing tasks, e.g., bilingual lexicon induction (BLI). However, the lack of data often impairs the quality of representations. Various approaches requiring only weak cross-lingual supervision were proposed, but current methods still fail to learn good CLWEs for languages with only a small monolingual corpus. We therefore claim that it is necessary to explore further datasets to improve CLWEs in low-resource setups. In this paper we propose to incorporate data of related high-resource languages. In contrast to previous approaches which leverage independently pre-trained embeddings of languages, we (i) train CLWEs for the low-resource and a related language jointly and (ii) map them to the target language to build the final multilingual space. In our experiments we focus on Occitan, a low-resource Romance language which is often neglected due to lack of resources. We leverage data from French, Spanish and Catalan for training and evaluate on the Occitan-English BLI task. By incorporating supporting languages our method outperforms previous approaches by a large margin. Furthermore, our analysis shows that the degree of relatedness between an incorporated language and the low-resource language is critically important.
-
Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2%. These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.
-
Les langues d’outre-mer sont régies en droit français tant par les dispositions générales relatives aux langues régionales que par les dispositions spécifiques relatives au statut des territoires dans lesquels elles sont en usage. Mais aucun de ces régimes juridiques ne leur offre une protection efficace. La notion de « langues de France » apparue il y a vingt ans n'a pas produit beaucoup d'effets. Pourrait-elle devenir une catégorie juridique ouvrant la voie à la reconnaissance de droits linguistiques ? Outre-mer, ces droits linguistiques seraient des préalables nécessaires à l’exercice effectif d’autres droits fondamentaux (notamment en matière d’éducation, de santé, de justice).
-
Occitan is a minority language spoken in Southern France, some Alpine Valleys of Italy, and the Val d'Aran in Spain, which only very recently started developing language and speech technologies. This paper describes the first project for designing a Text-to-Speech synthesis system for one of its main regional varieties, namely Gascon. We used a state-of-the-art deep neural network approach, the Tacotron2-WaveGlow system. However, we faced two additional difficulties or challenges: on the one hand, we wanted to test if it was possible to obtain good quality results with fewer recording hours than is usually reported for such systems; on the other hand, we needed to achieve a standard, non-Occitan pronunciation of French proper names, therefore we needed to record French words and test phoneme-based approaches. The evaluation carried out over the various developed systems and approaches shows promising results with near production-ready quality. It has also allowed us to detect the phenomena for which some flaws or fall of quality occur, pointing at the direction of future work to improve the quality of the actual system and for new systems for other language varieties and voices.
Explorer
Langue
- Alsacien (8)
- Breton (6)
- Corse (3)
- Créoles (3)
- Français (4)
- Guyane (1)
-
Multilingue
(11)
- Langues COLaF (5)
- Occitan (33)
- Picard (7)
- Poitevin-Saintongeais (1)
Tâche
Type de papier
- Classification des langues (9)
- Etat de l'art (2)
- Inventaire (2)
- Normalisation (3)
- Papiers COLaF (1)
- Prise de position (10)
- Projet (5)