COLAF
COLAF
A Propos
Financeurs
Partenaires
Equipe
Résultats
Publications
Sources bibliographiques
Ressources identifiées
Contact
Clair
Sombre
Automatique
Français
English
3
I Am Too Old For This Style!: A Stylometric Benchmark of Age Effect on Authorship Attribution
Can age act as a significant factor in determining authorship? While writers’ styles are known to measurably evolve over time, most computational pipelines still treat them as time- invariant. Building on recent work that tracks idiolectal change in several literary traditions, we introduce the first controlled benchmark that quantifies how both the magnitude and the direction of temporal distance affect attribution. A diachronic corpus of French novelists is sampled at bidirectional gaps of ±1, ±5, ±10, and ±15 years. Verification is then tested with a generative Bootstrap Distance Impostors (BDI) model and a discriminative linear Support Vector Machine (SVM). Results reveal a near-linear loss of confidence as the gap widens, but –critically– also show a systematic directional asymmetry: late-career texts remain recognisable when compared with early-career references, whereas early texts are markedly harder to verify against late ones, suggesting a cumulative rather than a substitutive evolution. The pattern persists under random-pair controls, confirming that it is not a sampling artefact. All code, data, and evaluation scripts are released to encourage further research on temporally robust authorship analysis and to quantify both the dynamics and amplitude of idiolectal change over time.
Florian Cafiero
,
Lucence Ing
,
Simon Gabay
,
Thibault Clérice
nov. 21, 2025
PDF
Hal
A French Version of the OLDI Seed Corpus
We present the first French partition of the OLDI Seed Corpus, our submission to the WMT 2025 Open Language Data Initiative (OLDI) shared task. We detail its creation process, which involved using multiple machine translation systems and a custom-built interface for post-editing by qualified native speakers. We also highlight the unique translation challenges presented by the source data, which combines highly technical, encyclopedic terminology with the stylistic irregularities characteristic of user-generated content taken from Wikipedia. This French corpus is not an end in itself, but is intended as a crucial pivot resource to facilitate the collection of parallel corpora for the under-resourced regional languages of France.
Malik Marmonier
,
Benoît Sagot
,
Rachel Bawden
nov. 20, 2025
PDF
Hal
Towards Skeletal and Signer Noise Reduction in Sign Language Production via Quaternion-Based Pose Encoding and Contrastive Learning
One of the main challenges in neural sign language production (SLP) lies in the high intra-class variability of signs, arising from signer morphology and stylistic variety in the training data. To improve robustness to such variations, we propose two enhancements to the standard Progressive Transformers (PT) architecture (Saunders et al., 2020). First, we encode poses using bone rotations in quaternion space and train with a geodesic loss to improve the accuracy and clarity of angular joint movements. Second, we introduce a contrastive loss to structure decoder embeddings by semantic similarity, using either gloss overlap or SBERT-based sentence similarity, aiming to filter out anatomical and stylistic features that do not convey relevant semantic information. On the Phoenix14T dataset, the contrastive loss alone yields a improvement in Probability of Correct Keypoint over the PT baseline. When combined with quaternion-based pose encoding, the model achieves a reduction in Mean Bone Angle Error. These results point to the benefit of incorporating skeletal structure modeling and semantically guided contrastive objectives on sign pose representations into the training of Transformer-based SLP models.
Guilhem Faure
,
Mostafa Sadeghi
,
Sam Bigeard
,
Slim Ouni
sept. 30, 2025
PDF
Hal
Preprocessing MediaPipe Joint Annotation for Sign Language SImilarity Analysis
This paper introduces a preprocessing pipeline for keypoints extracted using MediaPipe, aiming to improve pose annotation consistency in sign language datasets. We evaluate its effectiveness using a sign similarity task based on phonological features, without relying on gloss annotations. Similarity is measured using Dynamic Time Warping (DTW) across videos from sign language dictionaries. Although such similarity analyses can support various sign language processing applications - such as lexical search, clustering, and data enrichment - the main contribution of this work is to standardise pose features across heterogeneous sources, including different signers and backgrounds. Experiments on two dictionary datasets show that our pipeline significantly improves similarity measurements, with promising benefits for other sign language processing tasks.
Kehina Manseri
,
Sam Bigeard
,
Slim Ouni
sept. 16, 2025
PDF
Hal
La traduction automatique dialectale: état de l'art et étude préliminaire sur le continuum dialectal de l'occitan
Cet article dresse un état de l’art de la traduction automatique et de son évaluation pour les langues à variation dialectale, et en particulier pour les continuums dialectaux. Pour illustrer cet état de l’art, nous proposons une série d’expériences préliminaires sur le continuum occitan, afin de dresser un état des performances des systèmes existants pour la traduction depuis et vers plusieurs variétés d’occitan. Nos résultats indiquent d’une part des performances globalement satisfaisantes pour la traduction vers le français et l’anglais. D’autre part, des analyses mélangées à des outils d’identification de langues sur les prédictions vers l’occitan mettent en lumière la capacité de la plupart des systèmes évalués à générer des textes dans cette langue (y compris en zero-shot), mais révèlent aussi des limitations en termes d’évaluation de la diversité dialectale dans les traductions proposées.
Oriane Nédey
juin 30, 2025
PDF
Hal
COLaF : Corpus et Outils pour les Langues de France et variétés de français
This paper introduces a preprocessing pipeline for keypoints extracted using MediaPipe, aiming to improve pose annotation consistency in sign language datasets. We evaluate its effectiveness using a sign similarity task based on phonological features, without relying on gloss annotations. Similarity is measured using Dynamic Time Warping (DTW) across videos from sign language dictionaries. Although such similarity analyses can support various sign language processing applications - such as lexical search, clustering, and data enrichment - the main contribution of this work is to standardise pose features across heterogeneous sources, including different signers and backgrounds. Experiments on two dictionary datasets show that our pipeline significantly improves similarity measurements, with promising benefits for other sign language processing tasks.
Benoît Sagot
,
Slim Ouni
,
Sam Bigeard
,
Lucence Ing
,
Rasul Dent
,
Juliette Janès
,
Thibault Clérice
,
Rachel Bawden
,
Emmanuel Vincent
,
Oriane Nédey
,
Malek Yaich
,
Panagiotis Tsolakis
,
Vincent Colotte
,
Mostafa Sadeghi
juin 4, 2025
PDF
Hal
Retour d'expérience: Whisper pour les langues régionales
Notre objectif est de développer un système de reconnaissance automatique de la parole (ASR) de langues régionales. Pour cela, nous explorons la spécialisation ou l’adaptation de Whisper par affinage (fine-tuning). Dans cet article, nous présentons un retour d’expérience sur des travaux en cours dans deux langues : le basque et l’alsacien.
Sam Bigeard
,
Panagiotis Tsolakis
,
Emmanuel Vincent
,
Vincent Colotte
,
Pascale Erhart
,
Slim Ouni
nov. 17, 2024
PDF
Hal
The birth of French orthography. A computational analysis of French spelling systems in diachrony
Le XVIIe siècle est crucial pour la langue française, car il voit la création d’une norme orthographique stricte qui perdure en grande partie jusqu’à nos jours. Malgré son importance, l’histoire des systèmes orthographiques reste toutefois une zone négligée en linguistique pour deux raisons. D’une part, l’orthographe est constituée de microchangements qui nécessitent une approche quantitative, et d’autre part, aucun corpus n’est disponible en raison des interventions des éditeurs dans presque tous les textes déjà accessibles. Dans cet article, nous proposons donc un nouveau corpus permettant une telle étude, ainsi que les outils d’extraction et d’analyse nécessaires à notre recherche. En comparant le texte extrait par OCR avec une version alignée automatiquement sur l’orthographe contemporaine du français, nous extrayons les zones de variantes, nous catégorisons ces variantes et nous étudions leur fréquence afin d’analyser le changement (ortho)graphique au cours du XVIIe siècle.
Simon Gabay
,
Thibault Clérice
sept. 21, 2024
PDF
Hal
Molyé: A Corpus-based Approach to Language Contact in Colonial France
Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of intermediate forms. This work introduces a new open corpus, the Molyé corpus, which combines stereotypical representations of three kinds of language variation in Europe with early attestations of French-based Creole languages across a period of 400 years. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.
Rasul Dent
,
Juliette Janès
,
Thibault Clérice
,
Pedro Ortiz Suarez
,
Benoît Sagot
août 8, 2024
PDF
arXiv