Bibliographie complète | Bibliographie COLaF

Poujade, C., Fort, K., Gardent, C., & Parmentier, Y. (2023, November). Prise en compte de la variation dans l’annotation automatique morphosyntaxique de l’occitan. https://hal.science/hal-04622672

Occitan is a Romance language of France, a little part of Italy and Spain. It includes many written variations, dialectal and spelling variations. Being able to take variation into account is a major challenge to provide the language. Automatic processing of Occitan has been developing over the last ten years. Resources and tools have been developed and are beginning to take dialectal variation into account in these works. However, graphical variation is rarely taken into account. Our research focuses on the automatic annotation into lemmas, parts of speech and verbal inflection of a corpus of texts containing these two types of variation. From this corpus we train robust automatic annotation tools on global variation in Occitan.

Consulter le document

Soto, W., Parmentier, Y., & Gardent, C. (2023). Phylogeny-Inspired Soft Prompts For Data-to-Text Generation in Low-Resource Languages. In Y. Arase, B. Hu, & W. Lu (Eds.), IJCNLP-AACL 2023: The 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. ACL. https://hal.science/hal-04199557

Most work on verbalising Knowledge-Graphs (KG) has focused on high-resource languages such as English, Russian, Czech or Arabic. In this paper, we focus on KG-to-Text generation where the output text is in Breton, Irish or Welsh. To overcome the small size of the parallel training data, we combine the strengths of a multilingual encoder-decoder model with denoising fine-tuning on monolingual data and Soft Prompt fine-tuning on a small quantity of KG/text data. We furthermore structure the soft prompt into multiple sub-prompts designed to capture the similarities and differences between English, Knowledge graphs and the three target languages. Our experiments show that our approach outperforms strong baselines and that all sub-prompts contribute to performance.

Consulter le document

TEI Consortium. (2023). TEI P5: Guidelines for Electronic Text Encoding and Interchange Version 4.7.0 [Standard].

Jouitteau, M. (2023). Community Internally-driven Corpus Buildings. Three Examples from the Breton Ecosystem. 2nd Annual Meeting of the ELRA/ISCA SIG on Under-Resourced Languages (SIGUL 2023), 103–107. https://doi.org/10.21437/SIGUL.2023-22

This paper is a position paper concerning corpus-building strategies in minoritized languages in the Global North. It draws attention to the structure of the non-technical community of speakers, and concretely addresses how their needs can inform the design of technical solutions. Celtic Breton is taken as a case study for its relatively small speaker community, which is rather well-connected to modern technical infrastructures, and is bilingual with a non-English language (French). I report on three different community internal initiatives that have the potential to facilitate the growth of NLP-ready corpora in FAIR practices (Findability, Accessibility, Interoperability, Reusability). These initiatives follow a careful analysis of the Breton NLP situation both inside and outside of academia, and take advantage of preexisting dynamics. They are integrated to the speaking community, both on small and larger scales. They have in common the goal of creating an environment that fosters virtuous circles, in which various actors help each other. It is the interactions between these actors that create qualityenriched corpora usable for NLP, once some low-cost technical solutions are provided. This work aims at providing an estimate of the community’s internal potential to grow its own pool of resources, provided the right NLP resource gathering tools and ecosystem design. Some projects reported here are in the early stages of conception, while others build on decade-long society/research interfaces for the building of resources. All call for feedback from both NLP researchers and the speaking communities, contributing to building bridges and fruitful collaborations between these two groups.

Consulter le document

apertium/apertium-oci-fra. (2023). Apertium. https://github.com/apertium/apertium-oci-fra (Original work published 2018)

Apertium translation pair for Occitan and French

Consulter sur github.com

Scherrer, Y., Kuparinen, O., & Miletic, A. (2023). CorCoDial – Machine translation techniques for corpus-based computational dialectology. Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 511–512. https://events.tuni.fi/uploads/2023/06/11678752-proceedings-eamt2023.pdf

Consulter le document

Icard, B., Claveau, V., Atemezing, G., & Egré, P. (2023). Un traitement hybride du vague textuel : du système expert VAGO à son clone neuronal. In C. Servan & A. Vilnat (Eds.), Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux – articles longs (pp. 151–163). ATALA. https://aclanthology.org/2023.jeptalnrecital-long.12

L'outil VAGO est un système expert de détection du vague lexical qui mesure aussi le degré de subjectivité du discours, ainsi que son niveau de détail. Dans cet article, nous construisons un clone neuronal de VAGO, fondé sur une architecture de type BERT, entraîné à partir des scores du VAGO symbolique sur un corpus de presse française (FreSaDa). L'analyse qualitative et quantitative montre la fidélité de la version neuronale. En exploitant des outils d'explicabilité (LIME), nous montrons ensuite l'intérêt de cette version neuronale d'une part pour l'enrichissement des lexiques de la version symbolique, et d'autre part pour la production de versions dans d'autres langues.

Consulter sur aclanthology.org

Miletić, A. (2023). Outiller l’occitan : nouvelles ressources et lemmatisation. In C. Servan & A. Vilnat (Eds.), Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux – articles longs (pp. 217–231). ATALA. https://aclanthology.org/2023.jeptalnrecital-long.17

Ce travail présente des contributions récentes à l'effort de doter l'occitan de ressources et outils pour le TAL. Plusieurs ressources existantes ont été modifiées ou adaptées, notamment un tokéniseur à base de règles, un lexique morphosyntaxique et un corpus arboré. Ces ressources ont été utilisées pour entraîner et évaluer des modèles neuronaux pour la lemmatisation. Dans le cadre de ces expériences, un nouveau corpus plus large (2 millions de tokens) provenant du Wikipédia a été annoté en parties du discours, lemmatisé et diffusé.

Consulter le document

apertium/apertium-br-fr. (2023). Apertium. https://github.com/apertium/apertium-br-fr (Original work published 2018)

Apertium translation pair for Breton and French

Consulter sur github.com

Blaschke, V., Schütze, H., & Plank, B. (2023). Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages (No. arXiv:2304.10158). arXiv. https://doi.org/10.48550/arXiv.2304.10158

One of the challenges with finetuning pretrained language models (PLMs) is that their tokenizer is optimized for the language(s) it was pretrained on, but brittle when it comes to previously unseen variations in the data. This can for instance be observed when finetuning PLMs on one language and evaluating them on data in a closely related language variety with no standardized orthography. Despite the high linguistic similarity, tokenization no longer corresponds to meaningful representations of the target data, leading to low performance in, e.g., part-of-speech tagging. In this work, we finetune PLMs on seven languages from three different families and analyze their zero-shot performance on closely related, non-standardized varieties. We consider different measures for the divergence in the tokenization of the source and target data, and the way they can be adjusted by manipulating the tokenization during the finetuning step. Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data (the split word ratio difference) is the strongest predictor for model performance on target data.

Consulter le document

Adda, G., Vasilescu, I., & Yvon, F. (2023). Language Report French. In G. Rehm & A. Way (Eds.), European Language Equality: A Strategic Agenda for Digital Language Equality (pp. 139–142). Springer International Publishing. https://doi.org/10.1007/978-3-031-28819-7_16

This chapter presents a survey of the current state of technologies for the automatic processing of the French language. It is based on a thorough analysis of existing tools and resources for French, and also provides an accurate presentation of the domain and its main stakeholders (Adda et al. 2022). The chapter documents the presence of French on the internet and describes in broad terms the existing technologies for the French language. It also spells out general conclusions and formulates recommendations for progress towards deep language understanding for French.

Consulter sur doi.org

Bhatia, A., Sinha, S., Dingliwal, S., Gopalakrishnan, K., Bodapati, S., & Kirchhoff, K. (2023). Don’t Stop Self-Supervision: Accent Adaptation of Speech Representations via Residual Adapters.

Fabo, P. R. (2023). The MeThAL Alsatian theater corpus and related resources: Work done and perspectives. Actes Des Journées LIFT 2023.

Jouitteau, M., & Bideault, R. (2023). Outils numériques et traitement automatique du breton. In A. Rialland & M. Russo (Eds.), Langues régionales de France: nouvelles approches, nouvelles méthodologies, revitalisation (pp. 37–74). Société Linguistique de Paris. https://hal.science/hal-03918268

Consulter sur hal.science

Miletić, A., & Siewert, J. (2023). Lemmatization Experiments on Two Low-Resourced Languages: Low Saxon and Occitan. Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), 163–173. https://doi.org/10.18653/v1/2023.vardial-1.17

We present lemmatization experiments on the unstandardized low-resourced languages Low Saxon and Occitan using two machine-learningbased approaches represented by MaChAmp and Stanza. We show different ways to increase training data by leveraging historical corpora, small amounts of gold data and dictionary information, and discuss the usefulness of this additional data. In the results, we find some differences in the performance of the models depending on the language. This variation is likely to be partly due to differences in the corpora we used, such as the amount of internal variation. However, we also observe common tendencies, for instance that sequential models trained only on gold-annotated data often yield the best overall performance and generalize better to unknown tokens.

Consulter le document

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. Proceedings of the 40th International Conference on Machine Learning.

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results without the need for any dataset specific fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.

Wion, A. (2023). Cocoon: A Platform for Documenting the World’s Languages. https://shs.hal.science/halshs-04161812

Consulter sur shs.hal.science

Feltin-Palas, M. (2022). Sauvons les langues régionales ! Éditions Héliopoles. https://www.heliopoles.fr/produit/54/9782379850769/sauvons-les-langues-regionales

Dans l’esprit d’une majorité de Français, les langues dites régionales ne seraient que des « patois », de vulgaires déformations du français, de vagues idiomes tout juste bons à décrire des banalités. Pourquoi devraient-ils s’émouvoir de leur effacement ? Or, tous les linguistes le savent : le basque, le breton, l’alsacien, le corse, le picard et les autres, n’ont rien à envier au français, à l’anglais, à l’arabe ou au mandarin. La seule différence entre les « petites langues » et les autres, c’est que les premières n’ont pas eu la chance de devenir des langues officielles d’un État. Cet ouvrage affiche une ambition assumée : réconcilier la France avec sa diversité. Pour que le français reste notre langue commune, sans devenir notre langue unique.

Consulter sur www.heliopoles.fr

Miletic, A., & Scherrer, Y. (2022). OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan. In Y. Scherrer, T. Jauhiainen, N. Ljubešić, P. Nakov, J. Tiedemann, & M. Zampieri (Eds.), Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects (pp. 70–79). Association for Computational Linguistics. https://aclanthology.org/2022.vardial-1.8

This paper presents OcWikiDisc, a new freely available corpus in Occitan, as well as language identification experiments on Occitan done as part of the corpus building process. Occitan is a regional language spoken mainly in the south of France and in parts of Spain and Italy. It exhibits rich diatopic variation, it is not standardized, and it is still low-resourced, especially when it comes to large downloadable corpora. We introduce OcWikiDisc, a corpus extracted from the talk pages associated with the Occitan Wikipedia. The version of the corpus with the most restrictive language filtering contains 8K user messages for a total of 618K tokens. The language filtering is performed based on language identification experiments with five off-the-shelf tools, including the new fasttext's language identification model from Meta AI's No Language Left Behind initiative, released in July 2022.

Consulter sur aclanthology.org

Mompelat, L., Dakota, D., & Kübler, S. (2022). How to Parse a Creole: When Martinican Creole Meets French. In N. Calzolari, C.-R. Huang, H. Kim, J. Pustejovsky, L. Wanner, K.-S. Choi, P.-M. Ryu, H.-H. Chen, L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, & S.-H. Na (Eds.), Proceedings of the 29th International Conference on Computational Linguistics (pp. 4397–4406). International Committee on Computational Linguistics. https://aclanthology.org/2022.coling-1.387/

We investigate methods to develop a parser for Martinican Creole, a highly under-resourced language, using a French treebank. We compare transfer learning and multi-task learning models and examine different input features and strategies to handle the massive size imbalance between the treebanks. Surprisingly, we find that a simple concatenated (French + Martinican Creole) baseline yields optimal results even though it has access to only 80 Martinican Creole sentences. POS embeddings work better than lexical ones, but they suffer from negative transfer.

Consulter sur aclanthology.org

Rechercher

Bibliographie complète 166 ressources

Explorer

Corpus

Langue

Tâche

Type de papier