LibriSpeech ASR corpus
Identifier: SLR12
Summary: Large-scale (1000 hours) corpus of read English speech
Category: Speech
License: CC BY 4.0
Downloads (use a mirror closer to you):
dev-clean.tar.gz [error getting size] (development set, "clean" speech
) Mirrors:
[US]
[EU]
[CN]
dev-other.tar.gz [error getting size] (development set, "other", more challenging, speech
) Mirrors:
[US]
[EU]
[CN]
test-clean.tar.gz [error getting size] (test set, "clean" speech
) Mirrors:
[US]
[EU]
[CN]
test-other.tar.gz [error getting size] (test set, "other" speech
) Mirrors:
[US]
[EU]
[CN]
train-clean-100.tar.gz [error getting size] (training set of 100 hours "clean" speech
) Mirrors:
[US]
[EU]
[CN]
train-clean-360.tar.gz [error getting size] (training set of 360 hours "clean" speech
) Mirrors:
[US]
[EU]
[CN]
train-other-500.tar.gz [error getting size] (training set of 500 hours "other" speech
) Mirrors:
[US]
[EU]
[CN]
intro-disclaimers.tar.gz [error getting size] (extracted LibriVox announcements for some of the speakers
) Mirrors:
[US]
[EU]
[CN]
original-mp3.tar.gz [error getting size] (LibriVox mp3 files, from which corpus' audio was extracted
) Mirrors:
[US]
[EU]
[CN]
original-books.tar.gz [error getting size] (Project Gutenberg texts, against which the audio in the corpus was aligned
) Mirrors:
[US]
[EU]
[CN]
raw-metadata.tar.gz [error getting size] (Some extra meta-data produced during the creation of the corpus
) Mirrors:
[US]
[EU]
[CN]
md5sum.txt [error getting size] (MD5 checksums for the archive files
) Mirrors:
[US]
[EU]
[CN]
About this resource:
Acoustic models, trained on this data set, are available at kaldi-asr.org and language models, suitable for evaluation can be found at http://www.openslr.org/11/.
For more information, see the paper "LibriSpeech: an ASR corpus based on public domain audio books", Vassil Panayotov, Guoguo Chen, Daniel Povey and Sanjeev Khudanpur, ICASSP 2015 (submitted) (pdf)