COLaF
COLaF
About
Funding
Partners
Team
Results
Publications
Bibliographic sources
Identified ressources
Contact
Light
Dark
Automatic
English
Français
3
Preprocessing MediaPipe Joint Annotation for Sign Language SImilarity Analysis
This paper introduces a preprocessing pipeline for keypoints extracted using MediaPipe, aiming to improve pose annotation consistency in sign language datasets. We evaluate its effectiveness using a sign similarity task based on phonological features, without relying on gloss annotations. Similarity is measured using Dynamic Time Warping (DTW) across videos from sign language dictionaries. Although such similarity analyses can support various sign language processing applications - such as lexical search, clustering, and data enrichment - the main contribution of this work is to standardise pose features across heterogeneous sources, including different signers and backgrounds. Experiments on two dictionary datasets show that our pipeline significantly improves similarity measurements, with promising benefits for other sign language processing tasks.
Kehina Manseri
,
Sam Bigeard
,
Slim Ouni
Sep 16, 2025
PDF
Hal
COLaF : Corpus et Outils pour les Langues de France et variétés de français
This paper introduces a preprocessing pipeline for keypoints extracted using MediaPipe, aiming to improve pose annotation consistency in sign language datasets. We evaluate its effectiveness using a sign similarity task based on phonological features, without relying on gloss annotations. Similarity is measured using Dynamic Time Warping (DTW) across videos from sign language dictionaries. Although such similarity analyses can support various sign language processing applications - such as lexical search, clustering, and data enrichment - the main contribution of this work is to standardise pose features across heterogeneous sources, including different signers and backgrounds. Experiments on two dictionary datasets show that our pipeline significantly improves similarity measurements, with promising benefits for other sign language processing tasks.
Benoît Sagot
,
Slim Ouni
,
Sam Bigeard
,
Lucence Ing
,
Rasul Dent
,
Juliette Janès
,
Thibault Clérice
,
Rachel Bawden
,
Emmanuel Vincent
,
Oriane Nédey
,
Malek Yaich
,
Panagiotis Tsolakis
,
Vincent Colotte
,
Mostafa Sadeghi
Jun 4, 2025
PDF
Hal
Feedbacks: Whisper for regional languages
Our goal is to develop an automatic speech recognition (ASR) system for regional languages. To achieve this, we are exploring the specialization or adaptation of Whisper through fine-tunning. In this article, we present our feedbacks on ongoing work in two languages: Basque and Alsatian.
Sam Bigeard
,
Panagiotis Tsolakis
,
Emmanuel Vincent
,
Vincent Colotte
,
Pascale Erhart
,
Slim Ouni
Nov 17, 2024
PDF
Hal
The birth of French orthography. A computational analysis of French spelling systems in diachrony
The 17th c. is crucial for the French language, as it sees the creation of a strict orthographic norm that largely persists to this day. Despite its significance, the history of spelling systems remains however an overlooked area in linguistics for two reasons. On the one hand, spelling is made up of microchanges which requires a quantitative approach, and on the other hand, no corpus is available due to the interventions of editors in almost all the texts already available. In this paper, we therefore propose a new corpus allowing such a study, as well as the extraction and analysis tools necessary for our research. By comparing the text extracted with OCR and a version automatically aligned with contemporary French spelling, we extract the variant zones, we categorise these variants, and we study their frequency to study the (ortho)graphic change during the 17th century.
Simon Gabay
,
Thibault Clérice
Sep 21, 2024
PDF
Hal
Molyé: A Corpus-based Approach to Language Contact in Colonial France
Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of intermediate forms. This work introduces a new open corpus, the Molyé corpus, which combines stereotypical representations of three kinds of language variation in Europe with early attestations of French-based Creole languages across a period of 400 years. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.
Rasul Dent
,
Juliette Janès
,
Thibault Clérice
,
Pedro Ortiz Suarez
,
Benoît Sagot
Aug 8, 2024
PDF
arXiv