Open Speech and Language Resources



LibriSpeech language models, vocabulary and G2P models

Identifier: SLR11

Summary: Language modelling resources, for use with the LibriSpeech ASR corpus

Category: Text

License: Public domain

Downloads (use a mirror closer to you):
librispeech-lm-corpus.tgz [error getting size]   ( 14500 public domain books, used as training material for the LibriSpeech's LM )   Mirrors: [US]   [EU]   [CN]  
librispeech-lm-norm.txt.gz [error getting size]   (Normalized LM training text )   Mirrors: [US]   [EU]   [CN]  
librispeech-vocab.txt [error getting size]   (200K word vocabulary for the LM )   Mirrors: [US]   [EU]   [CN]  
librispeech-lexicon.txt [error getting size]   (Pronunciations, some of which G2P auto-generated, for all words in the vocabulary )   Mirrors: [US]   [EU]   [CN]  
3-gram.arpa.gz [error getting size]   (3-gram ARPA LM, not pruned )   Mirrors: [US]   [EU]   [CN]  
3-gram.pruned.1e-7.arpa.gz [error getting size]   (3-gram ARPA LM, pruned with theshold 1e-7 )   Mirrors: [US]   [EU]   [CN]  
3-gram.pruned.3e-7.arpa.gz [error getting size]   (3-gram ARPA LM, pruned with theshold 3e-7 )   Mirrors: [US]   [EU]   [CN]  
4-gram.arpa.gz [error getting size]   (4-gram ARPA LM, usually used for rescoring )   Mirrors: [US]   [EU]   [CN]  
g2p-model-5 [error getting size]   (Fifth order Sequitur G2P model )   Mirrors: [US]   [EU]   [CN]  

About this resource:

Language modeling resources to be used in conjunction with the (soon-to-be-released) LibriSpeech ASR corpus.

This corpus and these resources were prepared by Vassil Panayotov with the assistance of Daniel Povey and Sanjeev Khudanpur. We hope to finalize this and release the corpus here by the ICASSP deadline (early October 2014).