Resource | Name | Category | Summary |
SLR1 | Yesno
| Speech
| Sixty recordings of one individual saying yes or no in Hebrew; each recording is eight words long.
|
SLR2 | OpenFST
| Software
| A mirror of the OpenFst toolkit
|
SLR3 | sph2pipe
| Software
| A mirror of the sph2pipe software
|
SLR4 | sctk
| Software
| A mirror of the sctk scoring software
|
SLR5 | MSU Switchboard transcipts
| Text
| A mirror of the Mississippi State transcripts and lexicon for Switchboard.
|
SLR6 | Vystadial
| Speech
| English and Czech data, mirrored from the Vystadial project
|
SLR7 | TED-LIUM
| Speech
| English speech recognition training corpus from TED talks, created by Laboratoire d’Informatique de l’Université du Maine (LIUM) (mirrored here)
|
SLR8 | Sprakbanken
| Text
| Danish pronunciation dictionary generated using eSpeak
|
SLR9 | The AMI pack
| Text
| Some auxiliary non-speech data used to build AMI systems with Kaldi
|
SLR10 | SRE Data
| Misc
| Various files from SRE data that NIST used to host online
|
SLR11 | LibriSpeech language models, vocabulary and G2P models
| Text
| Language modelling resources, for use with the LibriSpeech ASR corpus
|
SLR12 | LibriSpeech ASR corpus
| Speech
| Large-scale (1000 hours) corpus of read English speech
|
SLR13 | RWCP Sound Scene Database
| Speech + Software
| A database of recordings of real-world sounds and measured room impulse responses
|
SLR14 | BEEP Dictionary
| Text
| Phonemic transcriptions of over 250,000 English words. (British English pronunciations)
|
SLR15 | SRE Speaker List
| Misc
| A list linking speakers across NIST SRE corpra
|
SLR16 | The AMI Corpus
| Speech
| Acoustic speech data and meta-data from The AMI corpus.
|
SLR17 | MUSAN
| Audio
| A corpus of music, speech, and noise
|
SLR18 | THCHS-30
| Speech
| A Free Chinese Speech Corpus Released by CSLT@Tsinghua University
|
SLR19 | TED-LIUMv2
| Audio
| TED-LIUM corpus release 2, English speech recognition training corpus from TED talks, created by Laboratoire d’Informatique de l’Université du Maine (LIUM) (mirrored here)
|
SLR20 | Aachen Impulse Response Database
| Audio
| Aachen Impulse Response database (AIR): a database of room impulse responses (mirrored here)
|
SLR21 | Spanish Word list
| Text
| A list of words in Spanish with frequency derived from a large corpus (Spanish Gigaword).
|
SLR22 | THUYG-20
| Speech
| A free Uyghur speech database Released by CSLT@Tsinghua University & Xinjiang University
|
SLR23 | NIST LRE 2007 Key
| Misc
| A file containing metadata for the utterances in the LRE 2007 evaluation
|
SLR24 | Iban
| Speech
| Iban language text and speech corpora for ASR
|
SLR25 | ALFFA (African Languages in the Field: speech Fundamentals and Automation)
| Speech
| Amharic, Swahili and Wolof data, mirrored from the ALFFA git repository
|
SLR26 | Simulated Room Impulse Response Database
| Audio
| A database of simulated room impulse responses
|
SLR27 | Cantab-TEDLIUM Release 1.1 (February 2015)
| Text
| Cantab Research Language models for the TEDLIUM database
|
SLR28 | Room Impulse Response and Noise Database
| Audio
| A database of simulated and real room impulse responses, isotropic and point-source noises. The audio files in this data are all in 16k sampling rate and 16-bit precision.
|
SLR29 | Sprakbanken_Swe
| Text
| Swedish pronunciation dictionary
|
SLR30 | Sinhala TTS
| Speech
| Sinhalese multi-speaker TTS corpora
|
SLR31 | Mini LibriSpeech ASR corpus
| Speech
| Subset of LibriSpeech corpus for purpose of regression testing
|
SLR32 | High quality TTS data for four South African languages (af, st, tn, xh)
| Speech
| Multi-speaker TTS data for four South African languages, Afrikaans, Sesotho, Setswana and isiXhosa.
|
SLR33 | Aishell
| Speech
| Mandarin data, provided by Beijing Shell Shell Technology Co.,Ltd
|
SLR34 | Santiago Spanish Lexicon
| Text
| A pronouncing dictionary for the Spanish language.
|
SLR35 | Large Javanese ASR training data set
| Speech
| Javanese ASR training data set containing ~185K utterances.
|
SLR36 | Large Sundanese ASR training data set
| Speech
| Sundanese ASR training data set containing ~220K utterances.
|
SLR37 | High quality TTS data for Bengali languages
| Speech
| Multi-speaker TTS data for Bangladesh Bengali (bn-BD) and Indian Bengali (bn-IN).
|
SLR38 | Free ST Chinese Mandarin Corpus
| Speech
| A free Chinese Mandarin corpus by Surfingtech (www.surfing.ai), containing utterances from 855 speakers, 102600 utterances;
|
SLR39 | Heroico
| Speech
| Spanish data, mirrored from the LDC
|
SLR40 | Zeroth-Korean
| Speech Corpus for Automatic Speech Recognition
| Korean Open-source Speech Corpus for Speech Recognition by Zeroth Project (https://github.com/goodatlas/zeroth)
|
SLR41 | High quality TTS data for Javanese.
| Speech
| Multi-speaker TTS data for Javanese (jv-ID)
|
SLR42 | High quality TTS data for Khmer.
| Speech
| Multi-speaker TTS data for Khmer (km-KH)
|
SLR43 | High quality TTS data for Nepali.
| Speech
| Multi-speaker TTS data for Nepali (ne-NP)
|
SLR44 | High quality TTS data for Sundanese.
| Speech
| Multi-speaker TTS data for Sundanese (su-ID)
|
SLR45 | Free ST American English Corpus
| Speech
| A free American English corpus by Surfingtech (www.surfing.ai), containing utterances from 10 speakers, Each speaker has about 350 utterances;
|
SLR46 | Tunisian_MSA
| Speech
| Tunisian Modern Standard Arabic
|
SLR47 | Primewords Chinese Corpus Set 1
| Speech
| Chinese Mandarin corpus released by Shanghai Primewords Co. Ltd. (www.primewords.cn), containing 100 hours of speech data.
|
SLR48 | MADCAT Arabic data splits
| Other
| Unofficial data splits (dev/train/test) for the MADCAT Arabic LDC corpus
|
SLR49 | VoxCeleb Data
| Misc
| Various files for the VoxCeleb datasets
|
SLR50 | MADCAT Chinese data splits
| Other
| Unofficial data splits (dev/train/test) for the MADCAT Chinese LDC corpus
|
SLR51 | TED-LIUM Release 3
| Speech
| TED-LIUM corpus release 3
|
SLR52 | Large Sinhala ASR training data set
| Speech
| Sinhala ASR training data set containing ~185K utterances.
|
SLR53 | Large Bengali ASR training data set
| Speech
| Bengali ASR training data set containing ~196K utterances.
|
SLR54 | Large Nepali ASR training data set
| Speech
| Nepali ASR training data set containing ~157K utterances.
|
SLR55 | CLMAD
| Text
| A Chinese Language Model Adaptation Dataset (CLMAD).
|
SLR56 | IAM Aachen splits
| Other
| Aachen data splits (train/test/val) for the IAM dataset.
|
SLR57 | African Accented French
| Speech
| Recordings of African Accented French speech.
|
SLR58 | Pansori-TEDxKR
| Speech
| Korean speech corpus generated from Korean language TEDx talks
|
SLR59 | ParlamentParla
| Speech
| Catalan speech corpus generated from Catalan Parliamentary sessions
|
SLR60 | LibriTTS corpus
| Speech
| Large-scale corpus of English speech derived from the original materials of the LibriSpeech corpus
|
SLR61 | Crowdsourced high-quality Argentinian Spanish speech data set.
| Speech
| Data set which contains 5739 recordings of native speakers of Spanish
|
SLR62 | aidatatang_200zh
| Speech
| A Chinese Mandarin speech corpus by Beijing DataTang Technology Co., Ltd, containing 200 hours of speech data from 600 speakers. The transcription accuracy for each sentence is larger than 98%.
|
SLR63 | Crowdsourced high-quality Malayalam multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Malayalam.
|
SLR64 | Crowdsourced high-quality Marathi multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Marathi
|
SLR65 | Crowdsourced high-quality Tamil multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Tamil.
|
SLR66 | Crowdsourced high-quality Telugu multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Telugu.
|
SLR67 | TEDx Spanish Corpus
| Speech
| Spanish data taken from the TEDx Talks
|
SLR68 | MAGICDATA Mandarin Chinese Read Speech Corpus
| Speech
| The corpus by Magic Data Technology Co., Ltd. , containing 755 hours of scripted read speech data from 1080 native speakers of the Mandarin Chinese spoken in mainland China. The sentence transcription accuracy is higher than 98%.
|
SLR69 | Crowdsourced high-quality Catalan speech data set.
| Speech
| Data set which contains recordings of Catalan.
|
SLR70 | Crowdsourced high-quality Nigerian English speech data set.
| Speech
| Data set which contains recordings of Nigerian English.
|
SLR71 | Crowdsourced high-quality Chilean Spanish speech data set.
| Speech
| Data set which contains recordings of Chilean Spanish.
|
SLR72 | Crowdsourced high-quality Columbian Spanish speech data set.
| Speech
| Data set which contains recordings of Columbian Spanish.
|
SLR73 | Crowdsourced high-quality Peruvian Spanish speech data set.
| Speech
| Data set which contains recordings of Peruvian Spanish.
|
SLR74 | Crowdsourced high-quality Puerto Rico Spanish speech data set.
| Speech
| Data set which contains recordings of Puerto Rico Spanish.
|
SLR75 | Crowdsourced high-quality Venezuelan Spanish speech data set.
| Speech
| Data set which contains recordings of Venezuelan Spanish.
|
SLR76 | Crowdsourced high-quality Basque speech data set.
| Speech
| Data set which contains recordings of Basque.
|
SLR77 | Crowdsourced high-quality Galician speech data set.
| Speech
| Data set which contains recordings of Galician.
|
SLR78 | Crowdsourced high-quality Gujarati multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Gujarati.
|
SLR79 | Crowdsourced high-quality Kannada multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Kannada.
|
SLR80 | Crowdsourced high-quality Burmese speech data set.
| Speech
| Data set which contains recordings of Burmese.
|
SLR81 | Small Audio Clips
| Speech
| Contains 20 one-second audio clips from various sources, for testing compression algorithms
|
SLR82 | CN-Celeb
| Speech
| A Free Chinese Speaker Recognition Corpus Released by CSLT@Tsinghua University
|
SLR83 | Crowdsourced high-quality UK and Ireland English Dialect speech data set.
| Speech
| Data set which contains male and female recordings of English from various dialects of the UK and Ireland.
|
SLR84 | ScribbleLens
| Handwriting
| Dutch cursive, 16-18th century handwritings, pages and lines, for (un)supervised AI and other research.
|
SLR85 | HI-MIA
| Speech
| A far-field text-dependent speaker verification database for AISHELL Speaker Verification Challenge 2019
|
SLR86 | Crowdsourced high-quality Yoruba speech data set.
| Speech
| Data set which contains recordings of Yoruba.
|
SLR87 | MobvoiHotwords
| Speech
| Chinese hotwords detection dataset, provided by Mobvoi CO.,LTD
|
SLR88 | Att-HACK
| Speech
| French Expressive Speech Database with Social Attitudes
|
SLR89 | Yoloxóchitl-Mixtec
| Speech
| Yolóxochitl Mixtec Speech with Transcription
|
SLR92 | Pueble-Nahuatl
| Speech
| Pueble Nahuatl Speech with Transcription
|
SLR93 | AISHELL-3
| Speech
| Mandarin data, provided by Beijing Shell Shell Technology Co., Ltd.
|
SLR94 | Multilingual LibriSpeech (MLS)
| Speech
| A large multilingual corpus derived from LibriVox audiobooks
|
SLR95 | Thorsten Müller (TTS)
| Speech
| Free single german speaker dataset (> 23 hours) by Thorsten Müller (voice) and Dominik Kreutz (audio optimization) for tts training
|
SLR96 | Russian LibriSpeech (RuLS)
| Speech
| This dataset is based on LibriVox audiobooks
|