openslr.org

Open Speech and Language Resources

MediaSpeech

Identifier: SLR108

Summary: French, Arabic, Turkish and Spanish media speech datasets

Category: Speech

License: dataset is distributed under the Creative Commons Attribution 4.0 International License.

Downloads (use a mirror closer to you):
AR.tgz [635M] ( Arabic speech and transcripts ) Mirrors: [EU] [EU] [CN]
ES.tgz [606M] ( Spanish speech and transcripts ) Mirrors: [EU] [EU] [CN]
FR.tgz [661M] ( French speech and transcripts ) Mirrors: [EU] [EU] [CN]
TR.tgz [642M] ( Turkish speech and transcripts ) Mirrors: [EU] [EU] [CN]

About this resource:

MediaSpeech is a dataset of French, Arabic, Turkish and Spanish media speech built with the purpose of testing Automated Speech Recognition (ASR) systems performance. The dataset contains 10 hours of speech for each language provided.

The dataset consists of short speech segments automatically extracted from media videos available on YouTube and manually transcribed, with some pre- and post-processing.

Baseline models and wav version of the dataset can be found in the following git repository: https://github.com/NTRLab/MediaSpeech

To cite the dataset, please use the following BibTeX entry:

@misc{mediaspeech2021,
      title={MediaSpeech: Multilanguage ASR Benchmark and Dataset}, 
      author={Rostislav Kolobov and Olga Okhapkina and Olga Omelchishina, Andrey Platunov and Roman Bedyakin and Vyacheslav Moshkin and Dmitry Menshikov and Nikolay Mikhaylovskiy},
      year={2021},
      eprint={2103.16193},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

External URLs:
https://github.com/NTRLab/MediaSpeech/releases/download/1.1/AR.zip (Arabic wav )
https://github.com/NTRLab/MediaSpeech/releases/download/1.1/ES.zip (Spanish wav )
https://github.com/NTRLab/MediaSpeech/releases/download/1.1/FR.zip (French wav )
https://github.com/NTRLab/MediaSpeech/releases/download/1.1/TR.zip (Turkish wav )