Machine Translation from Standard German to Alemannic Dialects

Type de ressource
Conference Paper
Auteurs/contributeurs
Title
Machine Translation from Standard German to Alemannic Dialects
Abstract
Machine translation has been researched using deep neural networks in recent years. These networks require lots of data to learn abstract representations of the input stored in continuous vectors. Dialect translation has become more important since the advent of social media. In particular, when dialect speakers and standard language speakers no longer understand each other, machine translation is of rising concern. Usually, dialect translation is a typical low-resourced language setting facing data scarcity problems. Additionally, spelling inconsistencies due to varying pronunciations and the lack of spelling rules complicate translation. This paper presents the best-performing approaches to handle these problems for Alemannic dialects. The results show that back-translation and conditioning on dialectal manifestations achieve the most remarkable enhancement over the baseline. Using back-translation, a significant gain of +4.5 over the strong transformer baseline of 37.3 BLEU points is accomplished. Differentiating between several Alemannic dialects instead of treating Alemannic as one dialect leads to substantial improvements: Multi-dialectal translation surpasses the baseline on the dialectal test sets. However, training individual models outperforms the multi-dialectal approach. There, improvements range from 7.5 to 10.6 BLEU points over the baseline depending on the dialect.
Date
2022-06
Proceedings Title
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages
Conference Name
SIGUL 2022
Place
Marseille, France
Publisher
European Language Resources Association
Pages
129–136
Accessed
15/05/2024 08:38
Library Catalog
ACLWeb
Notes

Summary

Related work:

  • BPE
  • transfer learning
  • back-translation
  • multilingual MT
  • RBMT / SMT / NMT - sometimes mixed

Alemannic dialect -> full range (not only Swiss German)

  • Markgräflerisch
  • Baseldeutsch (Basel German)
  • Schwäbisch (Swabian)
  • Oberalemannisch (High Alemannic)
  • Niederalemannisch (Low Alemannic)
  • Oberstalemannisch (Highest Alemannic)
  • Elsässisch (Alsatian)
  • others
  • not classified

Mostly translate from dialect into standard

Problems of evaluation: BLEU needs exact word matching //lack of orthographic standard

  • LCSR = longest common subsequence ratio -> yet very close to keeping the source text

Data:

  • bilingual: manually sentence-aligned sentences from an old Alemannic Wikipedia dump -> 16k sentences
  • monolingual: curated full Alemannic Wikipedia

    • option for users to tag articles with their local dialect -> keep/group metadata in the dataset (->29000 in Alsatian)
    • removed intersection with the bilingual corpus
    • -> 522k sentences (or 390k ?)
  • split train/dev/test: stratified split -> lack of test data for underrepresented dialects

Dialect classifier:

  • aimed at classifying Wikipedia articles without dialect tag
  • preprocess data: chunks of 6 paragraphs/250 tokens -> ~22K labelled samples
  • fine-tuning pre-trained RoBERTa for 10 epochs (Fairseq) -> 97.80% accuracy
  • prediction on chunked articles + absolute majority voting (otherwise removed from the corpus)

Tokenization: 8340 BPE-generated subwords

MT model architecture: Transformer

  • baseline = transformer trained on parallel corpus

    • increase dropout rates
  • Exp1: back-translation

    • train a LM on Standard German
    • train a MT model Alemannic -> Standard German on the parallel training data (BLEU 55.3)
    • MT the monolingual training dataset
    • train a MT model Standard German -> Alemannic on the back-translated training data
    • fine-tune the DE->dialect model on the parallel training data
    • Results:

      • global increase of 4.5 BLEU scores,
      • yet varies much across dialects (moslty decrease),
      • large increase for Alsatian, but biased as mostly quasi identical articles on Alsatian municipalities
  • Exp2: individual models per dialect

    • 3 MT models for 3 dialects: Margravian, Basel German, Swabian
    • train with the full parallel training data ???
    • fine-tune with the dialectal parallel training data
    • (maybe they mean pretraining on monolingual dialect data, then train on parallel data ?)
    • Results: improvements
  • Exp3: multidialectal (i.e. multilingual) model

    • transformer with 1 encoder (Standard German) + 5 decoders (for each dialect)
    • 103 epochs training, then fine-tuning for 10 epochs (= pretraining then training on parallel data ??)
    • better results when not sharing embeddings and decoders
    • Results:

      • improvements, yet worse than separate models, except for the lowest resourced dialect Swabian

Evaluation:

  • BLEU on global test set + per dialect
Référence
Lambrecht, L., Schneider, F., & Waibel, A. (2022). Machine Translation from Standard German to Alemannic Dialects. In M. Melero, S. Sakti, & C. Soria (Eds.), Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages (pp. 129–136). European Language Resources Association. https://aclanthology.org/2022.sigul-1.17