Votre recherche
Résultats 4 ressources
-
We investigate the effect of integrating lexicon information to an extremely low-resource language when annotated data is scarce for morpho-syntactic analysis. Obtaining such data and linguistic resources for these languages are usually constrained by a lack of human and financial resources making this task particularly challenging. In this paper, we describe the collection and leverage of a bilingual lexicon for Poitevin-Saintongeais, a regional language of France, to create augmented data through a neighbor-based distributional method. We assess this lexicon-driven approach in improving POS tagging while using different lexicon and augmented data sizes. To evaluate this strategy, we compare two distinct paradigms: neural networks, which typically require extensive data, and a conventional probabilistic approach, in which a lexicon is instrumental in its performance. Our findings reveal that the lexicon is a valuable asset for all models, but in particular for neural, demonstrating an enhanced generalization across diverse classes without requiring an extensive lexicon size.
-
This paper presents a first attempt to apply Universal Dependencies (De Marneffe et al., 2021) to train a parser for Mauritian Creole (MC), a French-based Creole language spoken on the island of Mauritius. This paper demonstrates the construction of a 161-sentence (1007-token) treebank for MC and evaluates the performance of a part-of-speech tagger and Universal Dependencies parser trained on this data. The sentences were collected from publicly available grammar books (Syea, 2013) and online resources (Baker and Kriegel, 2013), as well as from government-produced school textbooks (Antonio-Françoise et al., 2021; Natchoo et al., 2017). The parser, trained with UDPipe 2 (Straka, 2018), reached F1 scores of UPOS=86.2, UAS=80.8 and LAS=69.8. This fares favorably when compared to models of similar size for other under-resourced Indigenous and Creole languages. We then address some of the challenges faced when applying UD to Creole languages in general and to Mauritian Creole in particular. The main challenge was the handling of spelling variation in the input. Other issues include the tagging of modal verbs, middle voice sentences, and parts of the tense-aspect-mood system (such as the particle fek).
-
In this paper we present a series of experiments towards POS tagging Corsican, a less-resourced language spoken in Corsica and linguistically related to Italian. The first contribution is Corsican-POS, the first gold standard POS-tagged corpus for Corsica, composed of 500 sentences manually annotated with the Universal POS tagset. Our second contribution is a set of experiments and evaluation of POS tagging models which starts with a baseline model for Italian and is aimed at finding the best training configuration, namely in terms of the size and combination strategy of the existing raw and annotated resources. These experiments result in (i) the first POS tagger for Corsican, reaching an accuracy of 93.38%, (ii) a quantification of the gain provided by the use of each available resource. We find that the optimal configuration uses Italian word embeddings further specialized with Corsican embeddings and trained on the largest gold corpus for Corsican available so far.
-
Despite the success of the Universal Dependencies (UD) project exemplified by its impressive language breadth, there is still a lack in `within-language breadth': most treebanks focus on standard languages. Even for German, the language with the most annotations in UD, so far no treebank exists for one of its language varieties spoken by over 10M people: Bavarian. To contribute to closing this gap, we present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in UD, covering multiple text genres (wiki, fiction, grammar examples, social, non-fiction). We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies. Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries. We provide baseline parsing and POS tagging results, which are lower than results obtained on German and vary substantially between different graph-based parsers. To support further research on Bavarian syntax, we make our dataset, language-specific guidelines and code publicly available.
Explorer
Langue
- Corse (1)
- Créoles (1)
- Poitevin-Saintongeais (1)