COLaF
COLaF
About
Funding
Partners
Team
Results
Publications
Bibliographic sources
Identified ressources
Contact
Light
Dark
Automatic
English
Français
3
I Am Too Old For This Style!: A Stylometric Benchmark of Age Effect on Authorship Attribution
Can age act as a significant factor in determining authorship? While writers’ styles are known to measurably evolve over time, most computational pipelines still treat them as time- invariant. Building on recent work that tracks idiolectal change in several literary traditions, we introduce the first controlled benchmark that quantifies how both the magnitude and the direction of temporal distance affect attribution. A diachronic corpus of French novelists is sampled at bidirectional gaps of ±1, ±5, ±10, and ±15 years. Verification is then tested with a generative Bootstrap Distance Impostors (BDI) model and a discriminative linear Support Vector Machine (SVM). Results reveal a near-linear loss of confidence as the gap widens, but –critically– also show a systematic directional asymmetry: late-career texts remain recognisable when compared with early-career references, whereas early texts are markedly harder to verify against late ones, suggesting a cumulative rather than a substitutive evolution. The pattern persists under random-pair controls, confirming that it is not a sampling artefact. All code, data, and evaluation scripts are released to encourage further research on temporally robust authorship analysis and to quantify both the dynamics and amplitude of idiolectal change over time.
Florian Cafiero
,
Lucence Ing
,
Simon Gabay
,
Thibault Clérice
Nov 21, 2025
PDF
Hal
A French Version of the OLDI Seed Corpus
We present the first French partition of the OLDI Seed Corpus, our submission to the WMT 2025 Open Language Data Initiative (OLDI) shared task. We detail its creation process, which involved using multiple machine translation systems and a custom-built interface for post-editing by qualified native speakers. We also highlight the unique translation challenges presented by the source data, which combines highly technical, encyclopedic terminology with the stylistic irregularities characteristic of user-generated content taken from Wikipedia. This French corpus is not an end in itself, but is intended as a crucial pivot resource to facilitate the collection of parallel corpora for the under-resourced regional languages of France.
Malik Marmonier
,
Benoît Sagot
,
Rachel Bawden
Nov 20, 2025
PDF
Hal
Towards Skeletal and Signer Noise Reduction in Sign Language Production via Quaternion-Based Pose Encoding and Contrastive Learning
One of the main challenges in neural sign language production (SLP) lies in the high intra-class variability of signs, arising from signer morphology and stylistic variety in the training data. To improve robustness to such variations, we propose two enhancements to the standard Progressive Transformers (PT) architecture (Saunders et al., 2020). First, we encode poses using bone rotations in quaternion space and train with a geodesic loss to improve the accuracy and clarity of angular joint movements. Second, we introduce a contrastive loss to structure decoder embeddings by semantic similarity, using either gloss overlap or SBERT-based sentence similarity, aiming to filter out anatomical and stylistic features that do not convey relevant semantic information. On the Phoenix14T dataset, the contrastive loss alone yields a improvement in Probability of Correct Keypoint over the PT baseline. When combined with quaternion-based pose encoding, the model achieves a reduction in Mean Bone Angle Error. These results point to the benefit of incorporating skeletal structure modeling and semantically guided contrastive objectives on sign pose representations into the training of Transformer-based SLP models.
Guilhem Faure
,
Mostafa Sadeghi
,
Sam Bigeard
,
Slim Ouni
Sep 30, 2025
PDF
Hal
Preprocessing MediaPipe Joint Annotation for Sign Language SImilarity Analysis
This paper introduces a preprocessing pipeline for keypoints extracted using MediaPipe, aiming to improve pose annotation consistency in sign language datasets. We evaluate its effectiveness using a sign similarity task based on phonological features, without relying on gloss annotations. Similarity is measured using Dynamic Time Warping (DTW) across videos from sign language dictionaries. Although such similarity analyses can support various sign language processing applications - such as lexical search, clustering, and data enrichment - the main contribution of this work is to standardise pose features across heterogeneous sources, including different signers and backgrounds. Experiments on two dictionary datasets show that our pipeline significantly improves similarity measurements, with promising benefits for other sign language processing tasks.
Kehina Manseri
,
Sam Bigeard
,
Slim Ouni
Sep 16, 2025
PDF
Hal
Dialectal machine translation : survey and preliminary study on the Occitan dialect continuum
We present a state of the art of machine translation and its evaluation for languages with dialectal variation, and in particular for dialect continua. To accompany this overview, we propose a series of preliminary experiments working with the Occitan continuum, in order to assess the performance of existing systems with respect to translation from and into several varieties of Occitan. Our results indicate that translation into French and English is generally of good quality. Analyses combined with language identification tools applied to predictions into Occitan highlight the ability of most of the systems tested to generate texts in this language (even in zero-shot settings), but they also reveal limitations in terms of evaluation methods for the dialectal diversity in the proposed translations.
Oriane Nédey
Jun 30, 2025
PDF
Hal
COLaF : Corpus et Outils pour les Langues de France et variétés de français
This paper introduces a preprocessing pipeline for keypoints extracted using MediaPipe, aiming to improve pose annotation consistency in sign language datasets. We evaluate its effectiveness using a sign similarity task based on phonological features, without relying on gloss annotations. Similarity is measured using Dynamic Time Warping (DTW) across videos from sign language dictionaries. Although such similarity analyses can support various sign language processing applications - such as lexical search, clustering, and data enrichment - the main contribution of this work is to standardise pose features across heterogeneous sources, including different signers and backgrounds. Experiments on two dictionary datasets show that our pipeline significantly improves similarity measurements, with promising benefits for other sign language processing tasks.
Benoît Sagot
,
Slim Ouni
,
Sam Bigeard
,
Lucence Ing
,
Rasul Dent
,
Juliette Janès
,
Thibault Clérice
,
Rachel Bawden
,
Emmanuel Vincent
,
Oriane Nédey
,
Malek Yaich
,
Panagiotis Tsolakis
,
Vincent Colotte
,
Mostafa Sadeghi
Jun 4, 2025
PDF
Hal
Feedbacks: Whisper for regional languages
Our goal is to develop an automatic speech recognition (ASR) system for regional languages. To achieve this, we are exploring the specialization or adaptation of Whisper through fine-tunning. In this article, we present our feedbacks on ongoing work in two languages: Basque and Alsatian.
Sam Bigeard
,
Panagiotis Tsolakis
,
Emmanuel Vincent
,
Vincent Colotte
,
Pascale Erhart
,
Slim Ouni
Nov 17, 2024
PDF
Hal
The birth of French orthography. A computational analysis of French spelling systems in diachrony
The 17th c. is crucial for the French language, as it sees the creation of a strict orthographic norm that largely persists to this day. Despite its significance, the history of spelling systems remains however an overlooked area in linguistics for two reasons. On the one hand, spelling is made up of microchanges which requires a quantitative approach, and on the other hand, no corpus is available due to the interventions of editors in almost all the texts already available. In this paper, we therefore propose a new corpus allowing such a study, as well as the extraction and analysis tools necessary for our research. By comparing the text extracted with OCR and a version automatically aligned with contemporary French spelling, we extract the variant zones, we categorise these variants, and we study their frequency to study the (ortho)graphic change during the 17th century.
Simon Gabay
,
Thibault Clérice
Sep 21, 2024
PDF
Hal
Molyé: A Corpus-based Approach to Language Contact in Colonial France
Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of intermediate forms. This work introduces a new open corpus, the Molyé corpus, which combines stereotypical representations of three kinds of language variation in Europe with early attestations of French-based Creole languages across a period of 400 years. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.
Rasul Dent
,
Juliette Janès
,
Thibault Clérice
,
Pedro Ortiz Suarez
,
Benoît Sagot
Aug 8, 2024
PDF
arXiv