Rechercher
Bibliographie complète 164 ressources
-
This paper describes a method of semi-automatic word spotting in minority languages, from one and the same Aesop fable “The North Wind and the Sun” translated in Romance languages/dialects from Hexagonal (i.e. Metropolitan) France and languages from French Polynesia. The first task consisted of finding out how a dozen words such as “wind” and “sun” were translated in over 200 versions collected in the field — taking advantage of orthographic similarity, word position and context. Occurrences of the translations were then extracted from the phone-aligned recordings. The results were judged accurate in 96–97% of cases, both on the development corpus and a test set of unseen data. Corrected alignments were then mapped and basemaps were drawn to make various linguistic phenomena immediately visible. The paper exemplifies how regular expressions may be used for this purpose. The final result, which takes the form of an online speaking atlas (enriching the https://atlas.limsi.fr website), enables us to illustrate lexical, morphological or phonetic variation.
-
Machine translation has been researched using deep neural networks in recent years. These networks require lots of data to learn abstract representations of the input stored in continuous vectors. Dialect translation has become more important since the advent of social media. In particular, when dialect speakers and standard language speakers no longer understand each other, machine translation is of rising concern. Usually, dialect translation is a typical low-resourced language setting facing data scarcity problems. Additionally, spelling inconsistencies due to varying pronunciations and the lack of spelling rules complicate translation. This paper presents the best-performing approaches to handle these problems for Alemannic dialects. The results show that back-translation and conditioning on dialectal manifestations achieve the most remarkable enhancement over the baseline. Using back-translation, a significant gain of +4.5 over the strong transformer baseline of 37.3 BLEU points is accomplished. Differentiating between several Alemannic dialects instead of treating Alemannic as one dialect leads to substantial improvements: Multi-dialectal translation surpasses the baseline on the dialectal test sets. However, training individual models outperforms the multi-dialectal approach. There, improvements range from 7.5 to 10.6 BLEU points over the baseline depending on the dialect.
-
Identifying phone inventories is a crucial component in language documentation and the preservation of endangered languages. However, even the largest collection of phone inventory only covers about 2000 languages, which is only 1/4 of the total number of languages in the world. A majority of the remaining languages are endangered. In this work, we attempt to solve this problem by estimating the phone inventory for any language listed in Glottolog, which contains phylogenetic information regarding 8000 languages. In particular, we propose one probabilistic model and one non-probabilistic model, both using phylogenetic trees (“language family trees”) to measure the distance between languages. We show that our best model outperforms baseline models by 6.5 F1. Furthermore, we demonstrate that, with the proposed inventories, the phone recognition model can be customized for every language in the set, which improved the PER (phone error rate) in phone recognition by 25%.
-
Poète, pédagogue, lexicographe, fabuliste et traducteur, Hector Poullet est un acteur majeur de la scène locale et un incorrigible polygraphe. A un âge où d’autres se retirent, il se voit aujourd’hui encore sollicité pour enseigner la langue ou pour assurer des conférences savantes. L’examen de sa bibliographie accuse un compte de trente-cinq ouvrages, publiés seul ou en équipe, dont on trouve ici un relevé. A cette production respectable, il faut ajouter nombre de préfaces, de traductions, de chapitres de livres, d’articles inclus dans des sommes collectives ou sur divers sites. Suivant la proposition de la Direction des Affaires Culturelles, nous avons conçu un film d’entretien, où l’on entend cette personnalité évoquer les épisodes significatifs de l’histoire récente de la Guadeloupe. La présente contribution ne pouvait se concevoir ni comme le verbatim de ce dialogue, ni comme le simple script du film. Nous choisissons donc d’éclairer deux épisodes mal connus de son action dans l’entrée du créole au collège et dans l’élaboration du premier dictionnaire du guadeloupéen. Itinérance et diversité Né en 1938, Hector aime rappeler la diversité de ses origines et le nomadisme de son enfance, avec un père originaire de Grande Terre et une mère de Basse Terre. Il vit une enfance de fils d’instituteur itinérant, balloté au gré des affectations parentales, aux quatre coins de l’archipel. Le bac en poche, il va à Paris pour poursuivre des études de sciences. Au fil d’un séjour qui s
-
Grapheme-to-Phoneme (G2P) has many applications in NLP and speech fields. Most existing work focuses heavily on languages with abundant training datasets, which limits the scope of target languages to less than 100 languages. This work attempts to apply zero-shot learning to approximate G2P models for all low-resource and endangered languages in Glottolog (about 8k languages). For any unseen target language, we first build the phylogenetic tree (i.e. language family tree) to identify top-k nearest languages for which we have training sets. Then we run models of those languages to obtain a hypothesis set, which we combine into a confusion network to propose a most likely hypothesis as an approximation to the target language. We test our approach on over 600 unseen languages and demonstrate it significantly outperforms baselines.
-
Languages are classified as low-resource when they lack the quantity of data necessary for training statistical and machine learning tools and models. Causes of resource scarcity vary but can include poor access to technology for developing these resources, a relatively small population of speakers, or a lack of urgency for collecting such resources in bilingual populations where the second language is high-resource. As a result, the languages described as low-resource in the literature are as different as Finnish on the one hand, with millions of speakers using it in every imaginable domain, and Seneca, with only a small-handful of fluent speakers using the language primarily in a restricted domain. While issues stemming from the lack of resources necessary to train models unite this disparate group of languages, many other issues cut across the divide between widely-spoken low-resource languages and endangered languages. In this position paper, we discuss the unique technological, cultural, practical, and ethical challenges that researchers and indigenous speech community members face when working together to develop language technology to support endangered language documentation and revitalization. We report the perspectives of language teachers, Master Speakers and elders from indigenous communities, as well as the point of view of academics. We describe an ongoing fruitful collaboration and make recommendations for future partnerships between academic researchers and language community stakeholders.
-
Apertium linguistic data for Occitan
-
L’Alsace peut être qualifiée de province au même titre que la Bourgogne ou la Franche-Comté, par exemple ; il y aura peu de désaccord à propos d’une telle dénomination. Mais quand il s’agit de définir la situation linguistique de l’Alsace, des difficultés nombreuses surgissent qui sont liées au fait qu’il ne s’agit pas seulement de repérer des langues en usage par rapport à une classification des langues fondée sur des critères scientifiques mais qu’il s’agit aussi de les situer les unes par ...
-
In this work, we investigate methods for the challenging task of translating between low- resource language pairs that exhibit some level of similarity. In particular, we consider the utility of transfer learning for translating between several Indo-European low-resource languages from the Germanic and Romance language families. In particular, we build two main classes of transfer-based systems to study how relatedness can benefit the translation performance. The primary system fine-tunes a model pre-trained on a related language pair and the contrastive system fine-tunes one pre-trained on an unrelated language pair. Our experiments show that although relatedness is not necessary for transfer learning to work, it does benefit model performance.
-
Creole languages such as Nigerian Pidgin English and Haitian Creole are under-resourced and largely ignored in the NLP literature. Creoles typically result from the fusion of a foreign language with multiple local languages, and what grammatical and lexical features are transferred to the creole is a complex process. While creoles are generally stable, the prominence of some features may be much stronger with certain demographics or in some linguistic situations. This paper makes several contributions: We collect existing corpora and release models for Haitian Creole, Nigerian Pidgin English, and Singaporean Colloquial English. We evaluate these models on intrinsic and extrinsic tasks. Motivated by the above literature, we compare standard language models with distributionally robust ones and find that, somewhat surprisingly, the standard language models are superior to the distributionally robust ones. We investigate whether this is an effect of over-parameterization or relative distributional stability, and find that the difference persists in the absence of over-parameterization, and that drift is limited, confirming the relative stability of creole languages.
-
Cross-lingual word embeddings (CLWEs) have proven indispensable for various natural language processing tasks, e.g., bilingual lexicon induction (BLI). However, the lack of data often impairs the quality of representations. Various approaches requiring only weak cross-lingual supervision were proposed, but current methods still fail to learn good CLWEs for languages with only a small monolingual corpus. We therefore claim that it is necessary to explore further datasets to improve CLWEs in low-resource setups. In this paper we propose to incorporate data of related high-resource languages. In contrast to previous approaches which leverage independently pre-trained embeddings of languages, we (i) train CLWEs for the low-resource and a related language jointly and (ii) map them to the target language to build the final multilingual space. In our experiments we focus on Occitan, a low-resource Romance language which is often neglected due to lack of resources. We leverage data from French, Spanish and Catalan for training and evaluate on the Occitan-English BLI task. By incorporating supporting languages our method outperforms previous approaches by a large margin. Furthermore, our analysis shows that the degree of relatedness between an incorporated language and the low-resource language is critically important.
-
After generations of exploitation, Indigenous people often respond negatively to the idea that their languages are data ready for the taking. By treating Indigenous knowledge as a commodity, speech and language technologists risk disenfranchising local knowledge authorities, reenacting the causes of language endangerment. Scholars in related fields have responded to calls for decolonisation, and we in the speech and language technology community need to follow suit, and explore what this means for our practices that involve Indigenous languages and the communities who own them. This paper reviews colonising discourses in speech and language technology, and suggests new ways of working with Indigenous communities, and seeks to open a discussion of a postcolonial approach to computational methods for supporting language vitality.
-
Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2%. These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.
-
Les langues d’outre-mer sont régies en droit français tant par les dispositions générales relatives aux langues régionales que par les dispositions spécifiques relatives au statut des territoires dans lesquels elles sont en usage. Mais aucun de ces régimes juridiques ne leur offre une protection efficace. La notion de « langues de France » apparue il y a vingt ans n'a pas produit beaucoup d'effets. Pourrait-elle devenir une catégorie juridique ouvrant la voie à la reconnaissance de droits linguistiques ? Outre-mer, ces droits linguistiques seraient des préalables nécessaires à l’exercice effectif d’autres droits fondamentaux (notamment en matière d’éducation, de santé, de justice).
Explorer
Corpus
- Langue des signes française (4)
- Parole (4)
-
Texte
(21)
-
Annotated
(9)
- Morphology (6)
- Parallel (2)
- Syntax (1)
- Web (7)
-
Annotated
(9)
Langue
- Alsacien (12)
- Breton (9)
- Corse (5)
- Créoles (3)
- Français (4)
- Guyane (1)
-
Multilingue
(15)
- Langues COLaF (9)
- Occitan (36)
- Picard (7)
- Poitevin-Saintongeais (3)
Tâche
Type de papier
- Classification des langues (9)
- Etat de l'art (3)
- Inventaire (2)
- Normalisation (3)
- Papiers COLaF (2)
- Prise de position (12)
- Projet (6)