Votre recherche

Réinitialiser la recherche

Langue

Poitevin-Saintongeais

Résultats 3 ressources

Résumés

Bernhard, D., Vergez-Couret, M., & Dupuy, E. (2024). Au-delà des normes : identifier et documenter les langues minorisées pour le traitement automatique des langues. Cahiers Du Plurilinguisme Européen, 16. https://doi.org/10.57086/cpe.1710

Dieser Artikel stellt Überlegungen zu den Herausforderungen der Dokumentation von Minderheitensprachen im digitalen Raum an, ausgehend von den Arbeiten, die im Rahmen des DIVITAL-Projekts durchgeführt wurden. Die ersten Arbeiten des Projekts betrafen die Sammlung von Korpora und ihre Dokumentation durch feinkörnige Metadaten. Diese Arbeiten haben zwei große Herausforderungen aufgezeigt: (i) die Identifizierung der Sprachen und ihrer Varianten im Rahmen der Normen für die Kodierung von Sprachnamen und (ii) die Schaffung neuer Ressourcen in Verbindung mit der aktuellen Praxis dieser Sprachen.

Consulter le document
Garcia Holgado, C., & Vergez-Couret, M. (2024). Empowering Low-Resource Regional Languages with Lexicons : A Comparative Study of NLP Tools for Morphosyntactic Analysis. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 5747–5756). ELRA and ICCL. https://aclanthology.org/2024.lrec-main.510

We investigate the effect of integrating lexicon information to an extremely low-resource language when annotated data is scarce for morpho-syntactic analysis. Obtaining such data and linguistic resources for these languages are usually constrained by a lack of human and financial resources making this task particularly challenging. In this paper, we describe the collection and leverage of a bilingual lexicon for Poitevin-Saintongeais, a regional language of France, to create augmented data through a neighbor-based distributional method. We assess this lexicon-driven approach in improving POS tagging while using different lexicon and augmented data sizes. To evaluate this strategy, we compare two distinct paradigms: neural networks, which typically require extensive data, and a conventional probabilistic approach, in which a lexicon is instrumental in its performance. Our findings reveal that the lexicon is a valuable asset for all models, but in particular for neural, demonstrating an enhanced generalization across diverse classes without requiring an extensive lexicon size.

Consulter le document
Stosic, D., Marjanović, S., Bernhard, D., Bras, M., Kevers, L., Retali-Medori, S., Vergez-Couret, M., & Werner, C. (2024). The ParCoLab Parallel Corpus and Its Extension to Four Regional Languages of France. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 16014–16023). ELRA and ICCL. https://aclanthology.org/2024.lrec-main.1392

Parallel corpora are still scarce for most of the world's language pairs. The situation is by no means different for regional languages of France. In addition, adequate web interfaces facilitate and encourage the use of parallel corpora by target users, such as language learners and teachers, as well as linguists. In this paper, we describe ParCoLab, a parallel corpus and a web platform for querying the corpus. From its onset, ParCoLab has been geared towards lower-resource languages, with an initial corpus in Serbian, along with French and English (later Spanish). We focus here on the extension of ParCoLab with a parallel corpus for four regional languages of France: Alsatian, Corsican, Occitan and Poitevin-Saintongeais. In particular, we detail criteria for choosing texts and issues related to their collection. The new parallel corpus contains more than 20k tokens per regional language.

Consulter le document

Flux web personnalisé

Dernière mise à jour depuis la base de données : 23/06/2025 15:08 (UTC)

Votre recherche

Résultats 3 ressources

Explorer

Corpus

Langue

Tâche

Type de papier