Résultats | Bibliographie COLaF

Bernhard, D., Vergez-Couret, M., & Dupuy, E. (2024). Au-delà des normes : identifier et documenter les langues minorisées pour le traitement automatique des langues. Cahiers Du Plurilinguisme Européen, 16. https://doi.org/10.57086/cpe.1710

Dieser Artikel stellt Überlegungen zu den Herausforderungen der Dokumentation von Minderheitensprachen im digitalen Raum an, ausgehend von den Arbeiten, die im Rahmen des DIVITAL-Projekts durchgeführt wurden. Die ersten Arbeiten des Projekts betrafen die Sammlung von Korpora und ihre Dokumentation durch feinkörnige Metadaten. Diese Arbeiten haben zwei große Herausforderungen aufgezeigt: (i) die Identifizierung der Sprachen und ihrer Varianten im Rahmen der Normen für die Kodierung von Sprachnamen und (ii) die Schaffung neuer Ressourcen in Verbindung mit der aktuellen Praxis dieser Sprachen.

Consulter le document

Lent, H., Tatariya, K., Dabre, R., Chen, Y., Fekete, M., Ploeger, E., Zhou, L., Armstrong, R.-A., Eijansantos, A., Malau, C., Heje, H. E., Lavrinovics, E., Kanojia, D., Belony, P., Bollmann, M., Grobol, L., Lhoneux, M. D., Hershcovich, D., DeGraff, M., … Bjerva, J. (2024). CreoleVal: Multilingual Multitask Benchmarks for Creoles. Transactions of the Association for Computational Linguistics, 12, 950–978. https://doi.org/10.1162/tacl_a_00682

Abstract Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research. While the genealogical ties between Creoles and a number of highly resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it is an aggregate of novel development datasets for reading comprehension relation classification, and machine translation for Creoles, in addition to a practical gateway to a handful of preexisting benchmarks. For each benchmark, we conduct baseline experiments in a zero-shot setting in order to further ascertain the capabilities and limitations of transfer learning for Creoles. Ultimately, we see CreoleVal as an opportunity to empower research on Creoles in NLP and computational linguistics, and in general, a step towards more equitable language technology around the globe.

Consulter le document

Stosic, D., Marjanović, S., Bernhard, D., Bras, M., Kevers, L., Retali-Medori, S., Vergez-Couret, M., & Werner, C. (2024). The ParCoLab Parallel Corpus and Its Extension to Four Regional Languages of France. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 16014–16023). ELRA and ICCL. https://aclanthology.org/2024.lrec-main.1392

Parallel corpora are still scarce for most of the world's language pairs. The situation is by no means different for regional languages of France. In addition, adequate web interfaces facilitate and encourage the use of parallel corpora by target users, such as language learners and teachers, as well as linguists. In this paper, we describe ParCoLab, a parallel corpus and a web platform for querying the corpus. From its onset, ParCoLab has been geared towards lower-resource languages, with an initial corpus in Serbian, along with French and English (later Spanish). We focus here on the extension of ParCoLab with a parallel corpus for four regional languages of France: Alsatian, Corsican, Occitan and Poitevin-Saintongeais. In particular, we detail criteria for choosing texts and issues related to their collection. The new parallel corpus contains more than 20k tokens per regional language.

Consulter le document

Mompelat, L., Dakota, D., & Kübler, S. (2022). How to Parse a Creole: When Martinican Creole Meets French. In N. Calzolari, C.-R. Huang, H. Kim, J. Pustejovsky, L. Wanner, K.-S. Choi, P.-M. Ryu, H.-H. Chen, L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, & S.-H. Na (Eds.), Proceedings of the 29th International Conference on Computational Linguistics (pp. 4397–4406). International Committee on Computational Linguistics. https://aclanthology.org/2022.coling-1.387/

We investigate methods to develop a parser for Martinican Creole, a highly under-resourced language, using a French treebank. We compare transfer learning and multi-task learning models and examine different input features and strategies to handle the massive size imbalance between the treebanks. Surprisingly, we find that a simple concatenated (French + Martinican Creole) baseline yields optimal results even though it has access to only 80 Martinican Creole sentences. POS embeddings work better than lexical ones, but they suffer from negative transfer.

Consulter sur aclanthology.org

Lent, H., Bugliarello, E., de Lhoneux, M., Qiu, C., & Søgaard, A. (2021). On Language Models for Creoles. In A. Bisazza & O. Abend (Eds.), Proceedings of the 25th Conference on Computational Natural Language Learning (pp. 58–71). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.conll-1.5

Creole languages such as Nigerian Pidgin English and Haitian Creole are under-resourced and largely ignored in the NLP literature. Creoles typically result from the fusion of a foreign language with multiple local languages, and what grammatical and lexical features are transferred to the creole is a complex process. While creoles are generally stable, the prominence of some features may be much stronger with certain demographics or in some linguistic situations. This paper makes several contributions: We collect existing corpora and release models for Haitian Creole, Nigerian Pidgin English, and Singaporean Colloquial English. We evaluate these models on intrinsic and extrinsic tasks. Motivated by the above literature, we compare standard language models with distributionally robust ones and find that, somewhat surprisingly, the standard language models are superior to the distributionally robust ones. We investigate whether this is an effect of over-parameterization or relative distributional stability, and find that the difference persists in the absence of over-parameterization, and that drift is limited, confirming the relative stability of creole languages.

Consulter sur aclanthology.org

Suárez, P. J. O., Sagot, B., & Romary, L. (2019). Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In P. Bański, A. Barbaresi, H. Biber, E. Breiteneder, S. Clematide, M. Kupietz, H. Lüngen, & C. Iliadi (Eds.), Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7). (pp. 9–16). Leibniz-Institut für Deutsche Sprache. https://doi.org/10.14618/ids-pub-9021

Consulter sur nbn-resolving.de

Bernhard, D., Ligozat, A.-L., Martin, F., Bras, M., Magistry, P., Vergez-Couret, M., Steiblé, L., Erhart, P., Hathout, N., Huck, D., Rey, C., Reynés, P., Rosset, S., Sibille, J., & Lavergne, T. (2018, May). Corpora with Part-of-Speech Annotations for Three Regional Languages of France: Alsatian, Occitan and Picard. 11th Edition of the Language Resources and Evaluation Conference. https://hal.science/hal-01704806

This article describes the creation of corpora with part-of-speech annotations for three regional languages of France: Alsatian, Occitan and Picard. These manual annotations were performed in the context of the RESTAURE project, whose goal is to develop resources and tools for these under-resourced French regional languages. The article presents the tagsets used in the annotation process as well as the resulting annotated corpora.

Consulter le document

Grouas, T., Mapelli, V., & Samier, Q. (2016). Review on the Existing Language Resources for Languages of France. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (pp. 4539–4542). European Language Resources Association (ELRA). https://aclanthology.org/L16-1719

With the support of the DGLFLF, ELDA conducted an inventory of existing language resources for the regional languages of France. The main aim of this inventory was to assess the exploitability of the identified resources within technologies. A total of 2,299 Language Resources were identified. As a second step, a deeper analysis of a set of three language groups (Breton, Occitan, overseas languages) was carried out along with a focus of their exploitability within three technologies: automatic translation, voice recognition/synthesis and spell checkers. The survey was followed by the organisation of the TLRF2015 Conference which aimed to present the state of the art in the field of the Technologies for Regional Languages of France. The next step will be to activate the network of specialists built up during the TLRF conference and to begin the organisation of a second TLRF conference. Meanwhile, the French Ministry of Culture continues its actions related to linguistic diversity and technology, in particular through a project with Wikimedia France related to contributions to Wikipedia in regional languages, the upcoming new version of the “Corpus de la Parole” and the reinforcement of the DGLFLF's Observatory of Linguistic Practices.

Consulter sur aclanthology.org

Leixa, J., Mapelli, V., & Choukri, K. (2014). Inventaire des ressources linguistiques des langues de France. ELDA. https://www.culture.gouv.fr/Thematiques/Langue-francaise-et-langues-de-France/Agir-pour-les-langues/Innover-dans-le-domaine-des-langues-et-du-numerique/Accompagner-les-technologies-de-la-langue-et-la-normalisation/Inventaire-des-ressources-linguistiques-des-langues-de-France

Consulter sur www.culture.gouv.fr

Votre recherche

Résultats 9 ressources

Explorer

Corpus

Langue

Tâche

Type de papier