OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles

Lison, Pierre; Tiedemann, Jörg

OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles

Type de ressource

Conference Paper

Auteurs/contributeurs

Lison, Pierre (Author)
Tiedemann, Jörg (Author)
Calzolari, Nicoletta (Editor)
Choukri, Khalid (Editor)
Declerck, Thierry (Editor)
Goggi, Sara (Editor)
Grobelnik, Marko (Editor)
Maegaard, Bente (Editor)
Mariani, Joseph (Editor)
Mazo, Helene (Editor)
Moreno, Asuncion (Editor)
Odijk, Jan (Editor)
Piperidis, Stelios (Editor)

Title

OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles

Abstract

We present a new major release of the OpenSubtitles collection of parallel corpora. The release is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages. The release also incorporates a number of enhancements in the preprocessing and alignment of the subtitles, such as the automatic correction of OCR errors and the use of meta-data to estimate the quality of each subtitle and score subtitle pairs.

Date

2016-05

Proceedings Title

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Place

Portorož, Slovenia

Publisher

European Language Resources Association (ELRA)

Pages

923–929

URL

https://aclanthology.org/L16-1147

Référence

Lison, P., & Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (pp. 923–929). European Language Resources Association (ELRA). https://aclanthology.org/L16-1147

Corpus

Texte
- Web

Langue

Multilingue

Lien vers cette notice

https://colaf.huma-num.fr/bibliography/232D4EKG