Tagging Occitan using French and Castillan Tree Tagger
Type de ressource
Conference Paper
Auteur/contributeur
- Vergez-Couret, Marianne (Author)
Title
Tagging Occitan using French and Castillan Tree Tagger
Abstract
Part-Of-Speech (POS) tagging, including tokenization and sentence splitting, is the first step in all Natural Language Processing chain. It usually requires substantial efforts to annotate corpora and produce lexicons. However, when these language resources are missing like in Occitan, rather than concentrate the effort in creating them, methods are settled to adapt existing rich-resourced languages tagger. For this to work, these methods exploit the etymologic proximity of the under-resourced language and a rich-resourced language. In this article, we focus on Occitan, which shares similarities with several romance languages including French and Castillan. The method consists in running existing morpho-syntactic tools, here Tree Tagger, on Occitan texts with first a translation of the frequent words in a rich-resourced language. We performed two distinct experimentations, one exploiting similarities between Occitan and French and the second exploiting similarities between Occitan and Castillan. This method only requires the listing of the 300 most frequent words (based on corpus) to construct two bilingual lexicons (Occitan/French and Occitan/Castillan). Our results are better than those obtained with the Apertium tagger using a larger lexicon.
Date
2013
Proceedings Title
Proceedings of the 3rd LTC Workshop on Less Resourced Languages, new technologies, new challenges and opportunities
Place
Poznan, Poland
Accessed
02/08/2024 14:14
Library Catalog
HAL Archives Ouvertes
Référence
Vergez-Couret, M. (2013). Tagging Occitan using French and Castillan Tree Tagger. Proceedings of the 3rd LTC Workshop on Less Resourced Languages, New Technologies, New Challenges and Opportunities. https://hal.science/hal-00986426
Lien vers cette notice