Votre recherche

Réinitialiser la recherche

Corpus

Texte
- Annotated
  - Parallel

Résultats 2 ressources

Résumés

Grobol, L., & Jouitteau, M. (2024, May). ARBRES Kenstur: a Breton-French Parallel Corpus Rooted in Field Linguistics. LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. https://hal.science/hal-04551941

ARBRES is an ongoing project of open science implemented as a platform (“wikigrammar”) documenting both the Breton language itself and the state of research and engineering work in linguistics and NLP. Along its nearly 15 years of operation, it has aggregated a wealth of linguistic data in the form of interlinear glosses with translations illustrating lexical items, grammatical features, dialectal variations… While these glosses were primarily meant for human consumption, their volume and the regular format imposed by the wiki engine used for the website also make them suitable for machine processing. ARBRES Kenstur is a new parallel corpus derived from the glosses in ARBRES, including about 5k phrases and sentences in Breton along with translations in standard French. The nature of the original data — sourced from field linguistic inquiries meant to document the structure of Breton — leads to a resource that is mechanically more concerned with the internal variations of the language and rare phenomena than typical parallel corpora. Preliminaries experiments in using this corpus show that it can help improve machine translation for Breton, demonstrating that sourcing data from field linguistic documentation can be a way to help provide NLP tools for minority and low-resource languages.

Consulter le document
Stosic, D., Marjanović, S., Bernhard, D., Bras, M., Kevers, L., Retali-Medori, S., Vergez-Couret, M., & Werner, C. (2024). The ParCoLab Parallel Corpus and Its Extension to Four Regional Languages of France. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 16014–16023). ELRA and ICCL. https://aclanthology.org/2024.lrec-main.1392

Parallel corpora are still scarce for most of the world's language pairs. The situation is by no means different for regional languages of France. In addition, adequate web interfaces facilitate and encourage the use of parallel corpora by target users, such as language learners and teachers, as well as linguists. In this paper, we describe ParCoLab, a parallel corpus and a web platform for querying the corpus. From its onset, ParCoLab has been geared towards lower-resource languages, with an initial corpus in Serbian, along with French and English (later Spanish). We focus here on the extension of ParCoLab with a parallel corpus for four regional languages of France: Alsatian, Corsican, Occitan and Poitevin-Saintongeais. In particular, we detail criteria for choosing texts and issues related to their collection. The new parallel corpus contains more than 20k tokens per regional language.

Consulter le document

Flux web personnalisé

Dernière mise à jour depuis la base de données : 23/06/2025 15:08 (UTC)

Votre recherche

Résultats 2 ressources

Explorer

Corpus

Langue

Tâche