Votre recherche

Réinitialiser la recherche

Corpus

Texte
- Annotated

Langue

Breton

Résultats 2 ressources

Résumés

Grobol, L., & Jouitteau, M. (2024, May). ARBRES Kenstur: a Breton-French Parallel Corpus Rooted in Field Linguistics. LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. https://hal.science/hal-04551941

ARBRES is an ongoing project of open science implemented as a platform (“wikigrammar”) documenting both the Breton language itself and the state of research and engineering work in linguistics and NLP. Along its nearly 15 years of operation, it has aggregated a wealth of linguistic data in the form of interlinear glosses with translations illustrating lexical items, grammatical features, dialectal variations… While these glosses were primarily meant for human consumption, their volume and the regular format imposed by the wiki engine used for the website also make them suitable for machine processing. ARBRES Kenstur is a new parallel corpus derived from the glosses in ARBRES, including about 5k phrases and sentences in Breton along with translations in standard French. The nature of the original data — sourced from field linguistic inquiries meant to document the structure of Breton — leads to a resource that is mechanically more concerned with the internal variations of the language and rare phenomena than typical parallel corpora. Preliminaries experiments in using this corpus show that it can help improve machine translation for Breton, demonstrating that sourcing data from field linguistic documentation can be a way to help provide NLP tools for minority and low-resource languages.

Consulter le document
Jouitteau, M. (2023). Community Internally-driven Corpus Buildings. Three Examples from the Breton Ecosystem. 2nd Annual Meeting of the ELRA/ISCA SIG on Under-Resourced Languages (SIGUL 2023), 103–107. https://doi.org/10.21437/SIGUL.2023-22

This paper is a position paper concerning corpus-building strategies in minoritized languages in the Global North. It draws attention to the structure of the non-technical community of speakers, and concretely addresses how their needs can inform the design of technical solutions. Celtic Breton is taken as a case study for its relatively small speaker community, which is rather well-connected to modern technical infrastructures, and is bilingual with a non-English language (French). I report on three different community internal initiatives that have the potential to facilitate the growth of NLP-ready corpora in FAIR practices (Findability, Accessibility, Interoperability, Reusability). These initiatives follow a careful analysis of the Breton NLP situation both inside and outside of academia, and take advantage of preexisting dynamics. They are integrated to the speaking community, both on small and larger scales. They have in common the goal of creating an environment that fosters virtuous circles, in which various actors help each other. It is the interactions between these actors that create qualityenriched corpora usable for NLP, once some low-cost technical solutions are provided. This work aims at providing an estimate of the community’s internal potential to grow its own pool of resources, provided the right NLP resource gathering tools and ecosystem design. Some projects reported here are in the early stages of conception, while others build on decade-long society/research interfaces for the building of resources. All call for feedback from both NLP researchers and the speaking communities, contributing to building bridges and fruitful collaborations between these two groups.

Consulter le document

Flux web personnalisé

Dernière mise à jour depuis la base de données : 23/06/2025 15:08 (UTC)

Votre recherche

Résultats 2 ressources

Explorer

Corpus

Langue

Tâche

Type de papier