Do not neglect related languages: The case of low-resource Occitan cross-lingual word embeddings
Type de ressource
Conference Paper
Auteurs/contributeurs
- Woller, Lisa (Author)
- Hangya, Viktor (Author)
- Fraser, Alexander (Author)
- Ataman, Duygu (Editor)
- Birch, Alexandra (Editor)
- Conneau, Alexis (Editor)
- Firat, Orhan (Editor)
- Ruder, Sebastian (Editor)
- Sahin, Gozde Gul (Editor)
Title
Do not neglect related languages: The case of low-resource Occitan cross-lingual word embeddings
Abstract
Cross-lingual word embeddings (CLWEs) have proven indispensable for various natural language processing tasks, e.g., bilingual lexicon induction (BLI). However, the lack of data often impairs the quality of representations. Various approaches requiring only weak cross-lingual supervision were proposed, but current methods still fail to learn good CLWEs for languages with only a small monolingual corpus. We therefore claim that it is necessary to explore further datasets to improve CLWEs in low-resource setups. In this paper we propose to incorporate data of related high-resource languages. In contrast to previous approaches which leverage independently pre-trained embeddings of languages, we (i) train CLWEs for the low-resource and a related language jointly and (ii) map them to the target language to build the final multilingual space. In our experiments we focus on Occitan, a low-resource Romance language which is often neglected due to lack of resources. We leverage data from French, Spanish and Catalan for training and evaluate on the Occitan-English BLI task. By incorporating supporting languages our method outperforms previous approaches by a large margin. Furthermore, our analysis shows that the degree of relatedness between an incorporated language and the low-resource language is critically important.
Date
2021-11
Proceedings Title
Proceedings of the 1st Workshop on Multilingual Representation Learning
Conference Name
MRL 2021
Place
Punta Cana, Dominican Republic
Publisher
Association for Computational Linguistics
Pages
41–50
Short Title
Do not neglect related languages
Accessed
13/05/2024 09:04
Library Catalog
ACLWeb
Référence
Woller, L., Hangya, V., & Fraser, A. (2021). Do not neglect related languages: The case of low-resource Occitan cross-lingual word embeddings. In D. Ataman, A. Birch, A. Conneau, O. Firat, S. Ruder, & G. G. Sahin (Eds.), Proceedings of the 1st Workshop on Multilingual Representation Learning (pp. 41–50). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.mrl-1.4
Langue
Lien vers cette notice