Open Speech and Language Resources



Contact
dpovey@gmail.com
Phone: 425 247 4129
(Daniel Povey)

Multilingual LibriSpeech (MLS)

Identifier: SLR94

Summary: A large multilingual corpus derived from LibriVox audiobooks

Category: Speech

License: CC BY 4.0

About this resource:

NOTE: The data is not hosted on OpenSLR (due to it's size) -- please use the links provided below to download.

Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.

ASR Resources

Consists of train, dev and test sets for each language. Also, includes small training set for limited supervision (10hr, 1hr or 10 minutes of labelled speech).

Language Download Link
(Original: flac)
Download Link
(Compressed: opus)
English * mls_english.tar.gz (2.4T) mls_english_opus.tar.gz (651G)
German mls_german.tar.gz (115G) mls_german_opus.tar.gz (29G)
Dutch mls_dutch.tar.gz (86G) mls_dutch_opus.tar.gz (23G)
French mls_french.tar.gz (61G) mls_french_opus.tar.gz (16G)
Spanish mls_spanish.tar.gz (50G) mls_spanish_opus.tar.gz (14G)
Italian mls_italian.tar.gz (15G) mls_italian_opus.tar.gz (3.8G)
Portuguese mls_portuguese.tar.gz (9.3G) mls_portuguese_opus.tar.gz (2.5G)
Polish mls_polish.tar.gz (6.2G) mls_polish_opus.tar.gz (1.6G)

LM Resources

Consists of language modelling corpus and pre-trained 3-gram and 5-gram LMs.

Language Download Link
English mls_lm_english.tar.gz (44G)
German mls_lm_german.tar.gz (2.7G)
Dutch mls_lm_dutch.tar.gz (1.4G)
French mls_lm_french.tar.gz (4.8G)
Spanish mls_lm_spanish.tar.gz (1.2G)
Italian mls_lm_italian.tar.gz (1.7G)
Portuguese mls_lm_portuguese.tar.gz (558M)
Polish mls_lm_polish.tar.gz (30M)

Other Resources

About Download Link
Downloaded text from LibriVox books lv_text.tar.gz (2.0G)
Unrated dev/test transcripts (before human rating) unrated_transcripts.tar.gz (2.8M)

MD5 Checksums

md5sum.txt

All the above links are hosted AWS S3 bucket and can be downloaded using AWS CLI tools as well. For downloading using AWS CLI tools, create an AWS account, put the credentials in the CLI tools and all the resources can be downloaded for free. To get the S3 bucket path of an URL, replace https://dl.fbaipublicfiles.com/ with s3://dl.fbaipublicfiles.com/. For example, the S3 bucket path of URL https://dl.fbaipublicfiles.com/mls/mls_polish.tar.gz is s3://dl.fbaipublicfiles.com/mls/mls_polish.tar.gz.

More details on these files and their directory structure can be found in README files included in the .tar.gz files.

You can cite the data using the following BibTeX entry:

@article{Pratap2020MLSAL,
  title={MLS: A Large-Scale Multilingual Dataset for Speech Research},
  author={Vineel Pratap and Qiantong Xu and Anuroop Sriram and Gabriel Synnaeve and Ronan Collobert},
  journal={ArXiv},
  year={2020},
  volume={abs/2012.03411}
}

NOTE: We have made few updates to the MLS dataset after our INTERSPEECH paper was submitted to include more #hours and also improve the quality of transcripts. To avoid confusion (by having multiple versions), we are making ONLY one release with all the improvements included. For accurate dataset statistics and baselines, please refer to the arXiv paper - https://arxiv.org/abs/2012.03411.

* This resource can also be downloaded in parts from the 100GB splits - mls_english_parts_list.txt and mls_english_opus_parts_list.txt for the original and compressed versions respectively. After downloading all the splits, run `cat` on the files to create a single tar file and verify the md5 checksum.