Open Speech and Language Resources



MediaSpeech

Identifier: SLR108

Summary: French, Arabic, Turkish and Spanish media speech datasets

Category: Speech

License: dataset is distributed under the Creative Commons Attribution 4.0 International License.

Downloads (use a mirror closer to you):
AR.tgz [error getting size]   ( Arabic speech and transcripts )   Mirrors: [US]   [EU]   [CN]  
ES.tgz [error getting size]   ( Spanish speech and transcripts )   Mirrors: [US]   [EU]   [CN]  
FR.tgz [error getting size]   ( French speech and transcripts )   Mirrors: [US]   [EU]   [CN]  
TR.tgz [error getting size]   ( Turkish speech and transcripts )   Mirrors: [US]   [EU]   [CN]  

About this resource:

MediaSpeech is a dataset of French, Arabic, Turkish and Spanish media speech built with the purpose of testing Automated Speech Recognition (ASR) systems performance. The dataset contains 10 hours of speech for each language provided.

The dataset consists of short speech segments automatically extracted from media videos available on YouTube and manually transcribed, with some pre- and post-processing.

Baseline models and wav version of the dataset can be found in the following git repository: https://github.com/NTRLab/MediaSpeech

To cite the dataset, please use the following BibTeX entry:

@misc{mediaspeech2021,
      title={MediaSpeech: Multilanguage ASR Benchmark and Dataset}, 
      author={Rostislav Kolobov and Olga Okhapkina and Olga Omelchishina, Andrey Platunov and Roman Bedyakin and Vyacheslav Moshkin and Dmitry Menshikov and Nikolay Mikhaylovskiy},
      year={2021},
      eprint={2103.16193},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

External URLs:
https://github.com/NTRLab/MediaSpeech/releases/download/1.1/AR.zip   (Arabic wav )
https://github.com/NTRLab/MediaSpeech/releases/download/1.1/ES.zip   (Spanish wav )
https://github.com/NTRLab/MediaSpeech/releases/download/1.1/FR.zip   (French wav )
https://github.com/NTRLab/MediaSpeech/releases/download/1.1/TR.zip   (Turkish wav )