MediaSpeech
Identifier: SLR108
Summary: French, Arabic, Turkish and Spanish media speech datasets
Category: Speech
License: dataset is distributed under the Creative Commons Attribution 4.0 International License.
Downloads (use a mirror closer to you):
AR.tgz [635M] ( Arabic speech and transcripts
) Mirrors:
[US]
[EU]
[CN]
ES.tgz [606M] ( Spanish speech and transcripts
) Mirrors:
[US]
[EU]
[CN]
FR.tgz [661M] ( French speech and transcripts
) Mirrors:
[US]
[EU]
[CN]
TR.tgz [642M] ( Turkish speech and transcripts
) Mirrors:
[US]
[EU]
[CN]
About this resource:
The dataset consists of short speech segments automatically extracted from media videos available on YouTube and manually transcribed, with some pre- and post-processing.
Baseline models and wav version of the dataset can be found in the following git repository: https://github.com/NTRLab/MediaSpeech
To cite the dataset, please use the following BibTeX entry:
@misc{mediaspeech2021, title={MediaSpeech: Multilanguage ASR Benchmark and Dataset}, author={Rostislav Kolobov and Olga Okhapkina and Olga Omelchishina, Andrey Platunov and Roman Bedyakin and Vyacheslav Moshkin and Dmitry Menshikov and Nikolay Mikhaylovskiy}, year={2021}, eprint={2103.16193}, archivePrefix={arXiv}, primaryClass={eess.AS} }
External URLs:
https://github.com/NTRLab/MediaSpeech/releases/download/1.1/AR.zip (Arabic wav
)
https://github.com/NTRLab/MediaSpeech/releases/download/1.1/ES.zip (Spanish wav
)
https://github.com/NTRLab/MediaSpeech/releases/download/1.1/FR.zip (French wav
)
https://github.com/NTRLab/MediaSpeech/releases/download/1.1/TR.zip (Turkish wav
)