Machine Translation from Standard German to Alemannic Dialects

Lambrecht, Louisa; Schneider, Felix; Waibel, Alexander

Machine Translation from Standard German to Alemannic Dialects

Consulter le document

Type de ressource

Conference Paper

Auteurs/contributeurs

Lambrecht, Louisa (Author)
Schneider, Felix (Author)
Waibel, Alexander (Author)
Melero, Maite (Editor)
Sakti, Sakriani (Editor)
Soria, Claudia (Editor)

Title

Machine Translation from Standard German to Alemannic Dialects

Abstract

Machine translation has been researched using deep neural networks in recent years. These networks require lots of data to learn abstract representations of the input stored in continuous vectors. Dialect translation has become more important since the advent of social media. In particular, when dialect speakers and standard language speakers no longer understand each other, machine translation is of rising concern. Usually, dialect translation is a typical low-resourced language setting facing data scarcity problems. Additionally, spelling inconsistencies due to varying pronunciations and the lack of spelling rules complicate translation. This paper presents the best-performing approaches to handle these problems for Alemannic dialects. The results show that back-translation and conditioning on dialectal manifestations achieve the most remarkable enhancement over the baseline. Using back-translation, a significant gain of +4.5 over the strong transformer baseline of 37.3 BLEU points is accomplished. Differentiating between several Alemannic dialects instead of treating Alemannic as one dialect leads to substantial improvements: Multi-dialectal translation surpasses the baseline on the dialectal test sets. However, training individual models outperforms the multi-dialectal approach. There, improvements range from 7.5 to 10.6 BLEU points over the baseline depending on the dialect.

Date

2022-06

Proceedings Title

Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

Conference Name

SIGUL 2022

Place

Marseille, France

Publisher

European Language Resources Association

Pages

129–136

URL

https://aclanthology.org/2022.sigul-1.17

Accessed

15/05/2024 08:38

Library Catalog

ACLWeb

Notes

Summary

Related work:

BPE
transfer learning
back-translation
multilingual MT
RBMT / SMT / NMT - sometimes mixed

Alemannic dialect -> full range (not only Swiss German)

Markgräflerisch
Baseldeutsch (Basel German)
Schwäbisch (Swabian)
Oberalemannisch (High Alemannic)
Niederalemannisch (Low Alemannic)
Oberstalemannisch (Highest Alemannic)
Elsässisch (Alsatian)
others
not classified

Mostly translate from dialect into standard

Problems of evaluation: BLEU needs exact word matching //lack of orthographic standard

LCSR = longest common subsequence ratio -> yet very close to keeping the source text

Data:

bilingual: manually sentence-aligned sentences from an old Alemannic Wikipedia dump -> 16k sentences
monolingual: curated full Alemannic Wikipedia
- option for users to tag articles with their local dialect -> keep/group metadata in the dataset (->29000 in Alsatian)
- removed intersection with the bilingual corpus
- -> 522k sentences (or 390k ?)
split train/dev/test: stratified split -> lack of test data for underrepresented dialects

Dialect classifier:

aimed at classifying Wikipedia articles without dialect tag
preprocess data: chunks of 6 paragraphs/250 tokens -> ~22K labelled samples
fine-tuning pre-trained RoBERTa for 10 epochs (Fairseq) -> 97.80% accuracy
prediction on chunked articles + absolute majority voting (otherwise removed from the corpus)

Tokenization: 8340 BPE-generated subwords

MT model architecture: Transformer

baseline = transformer trained on parallel corpus
- increase dropout rates
Exp1: back-translation
- train a LM on Standard German
- train a MT model Alemannic -> Standard German on the parallel training data (BLEU 55.3)
- MT the monolingual training dataset
- train a MT model Standard German -> Alemannic on the back-translated training data
- fine-tune the DE->dialect model on the parallel training data
- Results:
  - global increase of 4.5 BLEU scores,
  - yet varies much across dialects (moslty decrease),
  - large increase for Alsatian, but biased as mostly quasi identical articles on Alsatian municipalities
Exp2: individual models per dialect
- 3 MT models for 3 dialects: Margravian, Basel German, Swabian
- train with the full parallel training data ???
- fine-tune with the dialectal parallel training data
- (maybe they mean pretraining on monolingual dialect data, then train on parallel data ?)
- Results: improvements
Exp3: multidialectal (i.e. multilingual) model
- transformer with 1 encoder (Standard German) + 5 decoders (for each dialect)
- 103 epochs training, then fine-tuning for 10 epochs (= pretraining then training on parallel data ??)
- better results when not sharing embeddings and decoders
- Results:
  - improvements, yet worse than separate models, except for the lowest resourced dialect Swabian

Evaluation:

BLEU on global test set + per dialect

Référence

Lambrecht, L., Schneider, F., & Waibel, A. (2022). Machine Translation from Standard German to Alemannic Dialects. In M. Melero, S. Sakti, & C. Soria (Eds.), Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages (pp. 129–136). European Language Resources Association. https://aclanthology.org/2022.sigul-1.17

Tâche

Traduction automatique

Document

Lambrecht et al. - 2022 - Machine Translation from Standard German to Aleman.pdf

Lien vers cette notice

https://colaf.huma-num.fr/bibliography/6JDVUPWF