Lemmatization Experiments on Two Low-Resourced Languages: Low Saxon and Occitan

Type de ressource
Conference Paper
Auteurs/contributeurs
Title
Lemmatization Experiments on Two Low-Resourced Languages: Low Saxon and Occitan
Abstract
We present lemmatization experiments on the unstandardized low-resourced languages Low Saxon and Occitan using two machine-learningbased approaches represented by MaChAmp and Stanza. We show different ways to increase training data by leveraging historical corpora, small amounts of gold data and dictionary information, and discuss the usefulness of this additional data. In the results, we find some differences in the performance of the models depending on the language. This variation is likely to be partly due to differences in the corpora we used, such as the amount of internal variation. However, we also observe common tendencies, for instance that sequential models trained only on gold-annotated data often yield the best overall performance and generalize better to unknown tokens.
Date
2023
Proceedings Title
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Conference Name
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Place
Dubrovnik, Croatia
Publisher
Association for Computational Linguistics
Pages
163-173
Language
en
Short Title
Lemmatization Experiments on Two Low-Resourced Languages
Accessed
13/05/2024 09:03
Library Catalog
DOI.org (Crossref)
Référence
Miletić, A., & Siewert, J. (2023). Lemmatization Experiments on Two Low-Resourced Languages: Low Saxon and Occitan. Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), 163–173. https://doi.org/10.18653/v1/2023.vardial-1.17
Langue