Résultats | Bibliographie COLaF

Coleman, J., Krishnamachari, B., Rosales, R., & Iskarous, K. (2024). LLM-Assisted Rule Based Machine Translation for Low/No-Resource Languages. In M. Mager, A. Ebrahimi, S. Rijhwani, A. Oncevay, L. Chiruzzo, R. Pugh, & K. von der Wense (Eds.), Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024) (pp. 67–87). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.americasnlp-1.9

We propose a new paradigm for machine translation that is particularly useful for no-resource languages (those without any publicly available bilingual or monolingual corpora): LLM-RBMT (LLM-Assisted Rule Based Machine Translation). Using the LLM-RBMT paradigm, we design the first language education/revitalization-oriented machine translator for Owens Valley Paiute (OVP), a critically endangered Indigenous American language for which there is virtually no publicly available data. We present a detailed evaluation of the translator's components: a rule-based sentence builder, an OVP to English translator, and an English to OVP translator. We also discuss the potential of the paradigm, its limitations, and the many avenues for future research that it opens up.

Consulter le document

Grobol, L., & Jouitteau, M. (2024, May). ARBRES Kenstur: a Breton-French Parallel Corpus Rooted in Field Linguistics. LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. https://hal.science/hal-04551941

ARBRES is an ongoing project of open science implemented as a platform (“wikigrammar”) documenting both the Breton language itself and the state of research and engineering work in linguistics and NLP. Along its nearly 15 years of operation, it has aggregated a wealth of linguistic data in the form of interlinear glosses with translations illustrating lexical items, grammatical features, dialectal variations… While these glosses were primarily meant for human consumption, their volume and the regular format imposed by the wiki engine used for the website also make them suitable for machine processing. ARBRES Kenstur is a new parallel corpus derived from the glosses in ARBRES, including about 5k phrases and sentences in Breton along with translations in standard French. The nature of the original data — sourced from field linguistic inquiries meant to document the structure of Breton — leads to a resource that is mechanically more concerned with the internal variations of the language and rare phenomena than typical parallel corpora. Preliminaries experiments in using this corpus show that it can help improve machine translation for Breton, demonstrating that sourcing data from field linguistic documentation can be a way to help provide NLP tools for minority and low-resource languages.

Consulter le document

Soto, W., Parmentier, Y., & Gardent, C. (2023). Phylogeny-Inspired Soft Prompts For Data-to-Text Generation in Low-Resource Languages. In Y. Arase, B. Hu, & W. Lu (Eds.), IJCNLP-AACL 2023: The 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. ACL. https://hal.science/hal-04199557

Most work on verbalising Knowledge-Graphs (KG) has focused on high-resource languages such as English, Russian, Czech or Arabic. In this paper, we focus on KG-to-Text generation where the output text is in Breton, Irish or Welsh. To overcome the small size of the parallel training data, we combine the strengths of a multilingual encoder-decoder model with denoising fine-tuning on monolingual data and Soft Prompt fine-tuning on a small quantity of KG/text data. We furthermore structure the soft prompt into multiple sub-prompts designed to capture the similarities and differences between English, Knowledge graphs and the three target languages. Our experiments show that our approach outperforms strong baselines and that all sub-prompts contribute to performance.

Consulter le document

apertium/apertium-oci-fra. (2023). Apertium. https://github.com/apertium/apertium-oci-fra (Original work published 2018)

Apertium translation pair for Occitan and French

Consulter sur github.com

Scherrer, Y., Kuparinen, O., & Miletic, A. (2023). CorCoDial – Machine translation techniques for corpus-based computational dialectology. Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 511–512. https://events.tuni.fi/uploads/2023/06/11678752-proceedings-eamt2023.pdf

Consulter le document

apertium/apertium-br-fr. (2023). Apertium. https://github.com/apertium/apertium-br-fr (Original work published 2018)

Apertium translation pair for Breton and French

Consulter sur github.com

NLLB Team, Costa-jussà, M. R., Cross, J., Çelebi, O., Elbayad, M., Heafield, K., Heffernan, K., Kalbassi, E., Lam, J., Licht, D., Maillard, J., Sun, A., Wang, S., Wenzek, G., Youngblood, A., Akula, B., Barrault, L., Gonzalez, G. M., Hansanti, P., … Wang, J. (2022). No Language Left Behind: Scaling Human-Centered Machine Translation (No. arXiv:2207.04672). arXiv. https://doi.org/10.48550/arXiv.2207.04672

Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system. Finally, we open source all contributions described in this work, accessible at https://github.com/facebookresearch/fairseq/tree/nllb.

Consulter le document

Lambrecht, L., Schneider, F., & Waibel, A. (2022). Machine Translation from Standard German to Alemannic Dialects. In M. Melero, S. Sakti, & C. Soria (Eds.), Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages (pp. 129–136). European Language Resources Association. https://aclanthology.org/2022.sigul-1.17

Machine translation has been researched using deep neural networks in recent years. These networks require lots of data to learn abstract representations of the input stored in continuous vectors. Dialect translation has become more important since the advent of social media. In particular, when dialect speakers and standard language speakers no longer understand each other, machine translation is of rising concern. Usually, dialect translation is a typical low-resourced language setting facing data scarcity problems. Additionally, spelling inconsistencies due to varying pronunciations and the lack of spelling rules complicate translation. This paper presents the best-performing approaches to handle these problems for Alemannic dialects. The results show that back-translation and conditioning on dialectal manifestations achieve the most remarkable enhancement over the baseline. Using back-translation, a significant gain of +4.5 over the strong transformer baseline of 37.3 BLEU points is accomplished. Differentiating between several Alemannic dialects instead of treating Alemannic as one dialect leads to substantial improvements: Multi-dialectal translation surpasses the baseline on the dialectal test sets. However, training individual models outperforms the multi-dialectal approach. There, improvements range from 7.5 to 10.6 BLEU points over the baseline depending on the dialect.

Consulter le document

Khanna, T., Washington, J. N., Tyers, F. M., Bayatlı, S., Swanson, D. G., Pirinen, T. A., Tang, I., & Alòs i Font, H. (2021). Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages. Machine Translation, 35(4), 475–502. https://doi.org/10.1007/s10590-021-09260-6

Consulter sur doi.org

Fan, A., Bhosale, S., Schwenk, H., Ma, Z., El-Kishky, A., Goyal, S., Baines, M., Celebi, O., Wenzek, G., Chaudhary, V., Goyal, N., Birch, T., Liptchinsky, V., Edunov, S., Grave, E., Auli, M., & Joulin, A. (2020). Beyond English-Centric Multilingual Machine Translation.

Forcada, M. L., Ginestí-Rosell, M., Nordfalk, J., O’Regan, J., Ortiz-Rojas, S., Pérez-Ortiz, J. A., Sánchez-Martínez, F., Ramírez-Sánchez, G., & Tyers, F. M. (2011). Apertium: a free/open-source platform for rule-based machine translation. Machine Translation: MT, 25(2), 127–144. https://idp.springer.com/authorize/casa?redirect_uri=https://link.springer.com/article/10.1007/s10590-011-9090-0&casa_token=Z45kBcdYw5IAAAAA:8r9hwCmpu2bYLIB5wzQU_o1dYR4BnknydRP2nCga4J822owjSjeQoCN3fh_7Q68ot_QxrI5jHSNvbxRq

Consulter sur idp.springer.com

Carme Armentano i Oller, & Forcada, M. L. (2006). Open source machine translation between small languages: Catalan and Aranese Occitan. Proceedings of the 5th Workshop on Strategies for Developing Machine Translation for Minority Languages, 51–54. https://aclanthology.org/www.mt-archive.info/LREC-2006-Armentano.pdf

We describe the use of an open-source shallow-transfer machine translation engine, Apertium, and existing open-source linguistic data to build a bidirectional machine translation system for a new pair of 'small' languages, Catalan (6 million speakers) and the Aranese variety (5000 speakers) of Occitan (about 1 million speakers), and discuss its possible uses and their effects on the linguistic normalization of the smaller language.

Consulter le document

Votre recherche

Résultats 12 ressources

Explorer

Corpus

Langue

Tâche