Tag your Occitan text

You can find more information on this specific model here: https://github.com/DEFI-COLaF/modeles-papie

Cite with the following

Please remember that corpus creation and software engineering is valid research, so please cite these resources when you use this lemmatizer for your research: this includes the wonderful original research by E. Manjavacas, M. Kestemont and Á. Kádár as well as the software wrapping built to handle pre- and post-processing.

For each models, a bibliography and potentially other citable works are given, such as models and datasets are given.

@software{thibault_clerice_2020_3883590,
  author       = {Clérice, Thibault},
  title        = {Pie Extended, an extension for Pie with pre-processing and post-processing},
  month        = jun,
  year         = 2020,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.3883589},
  url          = {https://doi.org/10.5281/zenodo.3883589}
}
@inproceedings{manjavacas-etal-2019-improving,
    title = "Improving Lemmatization of Non-Standard Languages with Joint Learning",
    author = "Manjavacas, Enrique  and
      K{\'a}d{\'a}r, {\'A}kos  and
      Kestemont, Mike",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational
      Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1153",
    doi = "10.18653/v1/N19-1153",
    pages = "1493--1503",}
@software{nedey_2024,
  author    = {Nédey, Oriane and Janès, Juliette and Sagot, Benoît and Bawden, Rachel and Clérice, Thibault},
  title     = {Modèle Occitan (0.0.1)},
  month = may,
  year = 2024,
  publisher={COLaF},
  version={v0.0.1},
  url={https://github.com/DEFI-COLaF/modeles-papie}
}

@inproceedings{miletic:hal-02123743,
  TITLE = \{\{Transformation d'annotations en parties du discours et lemmes vers le format Universal Dependencies : {\'e}tude de cas pour l'alsacien et l'occitan}\\\},
  AUTHOR = {Miletic, Aleksandra and Bernhard, Delphine and Bras, Myriam and Ligozat, Anne-Laure and Vergez-Couret, Marianne},
  URL = {https://hal.science/hal-02123743},
  BOOKTITLE = \{\{26e conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN-2019) et 21e {\'e}dition la conf{\'e}rence jeunes chercheur$\times$euse$\times$s RECITAL\}\},
  ADDRESS = {Toulouse, France},
  PUBLISHER = \{\{ATALA\}\},
  SERIES = {Actes de la Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles},
  VOLUME = {2},
  PAGES = {427-435},
  YEAR = {2019},
  MONTH = Jul,
}

Information about the model

this model provides support for the lemmatization and part-of-speech tagging of Occitan texts. The model was trained on two datasets available for Occitan dialects:

  • Tolosa Treebank v2 (TTB), a manually annotated corpus (POS+lemma) of 26K tokens, distinguishing between four Occitan varieties (Langedocian, Gascon, Limousin, Provençal) and already split between training/development/test sets.
  • OcWikiAnnot (WIKI), a fully automatically annotated corpus (POS+lemma) of 2M tokens, based on an extraction of pages from Wikipedia in Occitan.
  • POS tagging
  • Model info Finetuned from Training data Model+Training hyperparameters Accuracy (TTB testset all dialects)
    MaChAmp by (Hopton and Aepli, 2024) mBERT finetuned on Occitan TTB 94.10
    PaPie_POS_WIKITTB scratch WIKI + TTB - 18 epochs
    - 1 layer for SentRNN, CharRNN, AttentionalDecoder
    - embeddings + hidden size 128
    - include PaPie LM during training
    93.58
    MaChAmp by (Miletic, 2023) mBERT TTB 92.26
  • Lemmatization
  • Model info Finetuned from Training data Model+Training hyperparameters Accuracy (TTB testset all dialects)
    Stanza by (Miletic, 2023) FastText TTB - input token+POS 93.21
    PaPie_Lemma_finetune-WIKI2TTB WIKI TTB - vocabulary expanded with new words/chars/lemmas/labels from TTB: 592 chars / 24000 words / 798 lemmas
    - 39 epochs
    92.89

Bibliography

This lemmatizer is provided to you thanks to the data of the LASLA, the software of Emmanuel Manjavacas and Mike Kestemont and some engineering from the École nationale des chartes. If you want to cite them :

  • E. Manjavacas & Á. Kádár & M. Kestemont, « Improving Lemmatization of Non-Standard Languages with Joint Learning », Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Special issue on "Natural Language Processing and Ancient Languages", 2019, pp. 493--1503.
  • Enrique Manjavacas & Mike Kestemont. (2019, January 17). emanjavacas/pie v0.1.3 (Version v0.1.3). Zenodo. http://doi.org/10.5281/zenodo.2542537 Check the latest version here :Zenodo DOI
  • Miletic Haddad , A 2023 , Outiller l'occitan : nouvelles ressources et lemmatisation . in C Servan & A Vilnat (eds) , Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) : Volume 1: travaux de recherche originaux - articles longs . Association pour le Traitement Automatique des Langues , Paris , pp. 217-231 , Conférence sur le Traitement Automatique des Langues Naturelles , Paris , France , 05/06/2023 .
  • Aleksandra Miletic, Delphine Bernhard, Myriam Bras, Anne-Laure Ligozat, Marianne Vergez-Couret. Transformation d’annotations en parties du discours et lemmes vers le format Universal Dependencies : étude de cas pour l’alsacien et l’occitan. 26e conférence sur le Traitement Automatique des Langues Naturelles (TALN-2019) et 21e édition la conférence jeunes chercheur·euse·s RECITAL, Jul 2019, Toulouse, France. pp.427-435. ⟨hal-02123743