Votre recherche

Réinitialiser la recherche

Dans les auteurs ou contributeurs

"Kevers, Laurent"

Résultats 4 ressources

Résumés

Kevers, L. (2021). L’identification de langue, un outil au service du corse et de l’évaluation des ressources linguistiques [Language identification, a tool for Corsican and for the evaluation of linguistic resources]. Traitement Automatique des Langues, 62(3), 13–37. https://aclanthology.org/2021.tal-3.2

Consulter sur aclanthology.org
Kevers, L. (2006). L’information biographique : modélisation, extraction et organisation en base de connaissances. In P. Mertens, C. Fairon, A. Dister, & P. Watrin (Eds.), Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues (pp. 680–689). ATALA. https://aclanthology.org/2006.jeptalnrecital-recital.4

L'extraction et la valorisation de données biographiques contenues dans les dépêches de presse est un processus complexe. Pour l'appréhender correctement, une définition complète, précise et fonctionnelle de cette information est nécessaire. Or, la difficulté que l'on rencontre lors de l'analyse préalable de la tâche d'extraction réside dans l'absence d'une telle définition. Nous proposons ici des conventions dans le but d'en développer une. Le principal concept utilisé pour son expression est la structuration de l'information sous forme de triplets sujet, relation, objet. Le début de définition ainsi construit est exploité lors de l'étape d'extraction d'informations par transducteurs à états finis. Il permet également de suggérer une solution d'implémentation pour l'organisation des données extraites en base de connaissances.

Consulter sur aclanthology.org
Millour, A., Brasile, L., Ghia, A., & Kevers, L. (2024). Agettivu, Aggitivu o Aghjettivu? POS Tagging Corsican Dialects. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 600–608). ELRA and ICCL. https://aclanthology.org/2024.lrec-main.52

In this paper we present a series of experiments towards POS tagging Corsican, a less-resourced language spoken in Corsica and linguistically related to Italian. The first contribution is Corsican-POS, the first gold standard POS-tagged corpus for Corsica, composed of 500 sentences manually annotated with the Universal POS tagset. Our second contribution is a set of experiments and evaluation of POS tagging models which starts with a baseline model for Italian and is aimed at finding the best training configuration, namely in terms of the size and combination strategy of the existing raw and annotated resources. These experiments result in (i) the first POS tagger for Corsican, reaching an accuracy of 93.38%, (ii) a quantification of the gain provided by the use of each available resource. We find that the optimal configuration uses Italian word embeddings further specialized with Corsican embeddings and trained on the largest gold corpus for Corsican available so far.

Consulter le document
Stosic, D., Marjanović, S., Bernhard, D., Bras, M., Kevers, L., Retali-Medori, S., Vergez-Couret, M., & Werner, C. (2024). The ParCoLab Parallel Corpus and Its Extension to Four Regional Languages of France. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 16014–16023). ELRA and ICCL. https://aclanthology.org/2024.lrec-main.1392

Parallel corpora are still scarce for most of the world's language pairs. The situation is by no means different for regional languages of France. In addition, adequate web interfaces facilitate and encourage the use of parallel corpora by target users, such as language learners and teachers, as well as linguists. In this paper, we describe ParCoLab, a parallel corpus and a web platform for querying the corpus. From its onset, ParCoLab has been geared towards lower-resource languages, with an initial corpus in Serbian, along with French and English (later Spanish). We focus here on the extension of ParCoLab with a parallel corpus for four regional languages of France: Alsatian, Corsican, Occitan and Poitevin-Saintongeais. In particular, we detail criteria for choosing texts and issues related to their collection. The new parallel corpus contains more than 20k tokens per regional language.

Consulter le document

Flux web personnalisé

Dernière mise à jour depuis la base de données : 23/06/2025 15:08 (UTC)

Votre recherche

Résultats 4 ressources

Explorer

Corpus

Langue

Tâche