LibriSpeech ASR corpus
Identifier: SLR12
Summary: Large-scale (1000 hours) corpus of read English speech
Category: Speech
License: CC BY 4.0
 Downloads (use a mirror closer to you): 
 dev-clean.tar.gz  [337M]   (development set, "clean" speech
)    Mirrors: 
 [US]   
 [EU]   
 [CN]   
 dev-other.tar.gz  [314M]   (development set, "other", more challenging, speech
)    Mirrors: 
 [US]   
 [EU]   
 [CN]   
 test-clean.tar.gz  [346M]   (test set, "clean" speech
)    Mirrors: 
 [US]   
 [EU]   
 [CN]   
 test-other.tar.gz  [328M]   (test set, "other" speech
)    Mirrors: 
 [US]   
 [EU]   
 [CN]   
 train-clean-100.tar.gz  [6.3G]   (training set of 100 hours "clean" speech
)    Mirrors: 
 [US]   
 [EU]   
 [CN]   
 train-clean-360.tar.gz  [23G]   (training set of 360 hours "clean" speech
)    Mirrors: 
 [US]   
 [EU]   
 [CN]   
 train-other-500.tar.gz  [30G]   (training set of 500 hours "other" speech
)    Mirrors: 
 [US]   
 [EU]   
 [CN]   
 intro-disclaimers.tar.gz  [695M]   (extracted LibriVox announcements for some of the speakers
)    Mirrors: 
 [US]   
 [EU]   
 [CN]   
 original-mp3.tar.gz  [87G]   (LibriVox mp3 files, from which corpus' audio was extracted
)    Mirrors: 
 [US]   
 [EU]   
 [CN]   
 original-books.tar.gz  [297M]   (Project Gutenberg texts, against which the audio in the corpus was aligned
)    Mirrors: 
 [US]   
 [EU]   
 [CN]   
 raw-metadata.tar.gz  [33M]   (Some extra meta-data produced during the creation of the corpus
)    Mirrors: 
 [US]   
 [EU]   
 [CN]   
 md5sum.txt  [600 bytes]   (MD5 checksums for the archive files
)    Mirrors: 
 [US]   
 [EU]   
 [CN]   
About this resource:
Acoustic models, trained on this data set, are available at kaldi-asr.org and language models, suitable for evaluation can be found at http://www.openslr.org/11/.
For more information, see the paper "LibriSpeech: an ASR corpus based on public domain audio books", Vassil Panayotov, Guoguo Chen, Daniel Povey and Sanjeev Khudanpur, ICASSP 2015 (submitted) (pdf)