openslr.org

Open Speech and Language Resources

Contact

jtrmal@gmail.com
(Jan "Yenda" Trmal)

Resources

Resource	Name	Category	Summary
SLR1	Yesno	Speech	Sixty recordings of one individual saying yes or no in Hebrew; each recording is eight words long.
SLR2	OpenFST	Software	A mirror of the OpenFst toolkit
SLR3	sph2pipe	Software	A mirror of the sph2pipe software
SLR4	sctk	Software	A mirror of the sctk scoring software
SLR5	MSU Switchboard transcipts	Text	A mirror of the Mississippi State transcripts and lexicon for Switchboard.
SLR6	Vystadial	Speech	English and Czech data, mirrored from the Vystadial project
SLR8	Sprakbanken	Text	Danish pronunciation dictionary generated using eSpeak
SLR9	The AMI pack	Text	Some auxiliary non-speech data used to build AMI systems with Kaldi
SLR10	SRE Data	Misc	Various files from SRE data that NIST used to host online
SLR11	LibriSpeech language models, vocabulary and G2P models	Text	Language modelling resources, for use with the LibriSpeech ASR corpus
SLR12	LibriSpeech ASR corpus	Speech	Large-scale (1000 hours) corpus of read English speech
SLR13	RWCP Sound Scene Database	Speech + Software	A database of recordings of real-world sounds and measured room impulse responses
SLR14	BEEP Dictionary	Text	Phonemic transcriptions of over 250,000 English words. (British English pronunciations)
SLR15	SRE Speaker List	Misc	A list linking speakers across NIST SRE corpra
SLR16	The AMI Corpus	Speech	Acoustic speech data and meta-data from The AMI corpus.
SLR17	MUSAN	Audio	A corpus of music, speech, and noise
SLR18	THCHS-30	Speech	A Free Chinese Speech Corpus Released by CSLT@Tsinghua University
SLR20	Aachen Impulse Response Database	Audio	Aachen Impulse Response database (AIR): a database of room impulse responses (mirrored here)
SLR21	Spanish Word list	Text	A list of words in Spanish with frequency derived from a large corpus (Spanish Gigaword).
SLR22	THUYG-20	Speech	A free Uyghur speech database Released by CSLT@Tsinghua University & Xinjiang University
SLR23	NIST LRE 2007 Key	Misc	A file containing metadata for the utterances in the LRE 2007 evaluation
SLR24	Iban	Speech	Iban language text and speech corpora for ASR
SLR25	ALFFA (African Languages in the Field: speech Fundamentals and Automation)	Speech	Amharic, Swahili and Wolof data, mirrored from the ALFFA git repository
SLR26	Simulated Room Impulse Response Database	Audio	A database of simulated room impulse responses
SLR27	Cantab-TEDLIUM Release 1.1 (February 2015)	Text	Cantab Research Language models for the TEDLIUM database
SLR28	Room Impulse Response and Noise Database	Audio	A database of simulated and real room impulse responses, isotropic and point-source noises. The audio files in this data are all in 16k sampling rate and 16-bit precision.
SLR29	Sprakbanken_Swe	Text	Swedish pronunciation dictionary
SLR30	Sinhala TTS	Speech	Sinhalese multi-speaker TTS corpora
SLR31	Mini LibriSpeech ASR corpus	Speech	Subset of LibriSpeech corpus for purpose of regression testing
SLR32	High quality TTS data for four South African languages (af, st, tn, xh)	Speech	Multi-speaker TTS data for four South African languages, Afrikaans, Sesotho, Setswana and isiXhosa.
SLR33	Aishell	Speech	Mandarin data, provided by Beijing Shell Shell Technology Co.,Ltd
SLR34	Santiago Spanish Lexicon	Text	A pronouncing dictionary for the Spanish language.
SLR35	Large Javanese ASR training data set	Speech	Javanese ASR training data set containing ~185K utterances.
SLR36	Large Sundanese ASR training data set	Speech	Sundanese ASR training data set containing ~220K utterances.
SLR37	High quality TTS data for Bengali languages	Speech	Multi-speaker TTS data for Bangladesh Bengali (bn-BD) and Indian Bengali (bn-IN).
SLR38	Free ST Chinese Mandarin Corpus	Speech	A free Chinese Mandarin corpus by Surfingtech (www.surfing.ai), containing utterances from 855 speakers, 102600 utterances;
SLR39	Heroico	Speech	Spanish data, mirrored from the LDC
SLR40	Zeroth-Korean	Speech Corpus for Automatic Speech Recognition	Korean Open-source Speech Corpus for Speech Recognition by Zeroth Project (https://github.com/goodatlas/zeroth)
SLR41	High quality TTS data for Javanese.	Speech	Multi-speaker TTS data for Javanese (jv-ID)
SLR42	High quality TTS data for Khmer.	Speech	Multi-speaker TTS data for Khmer (km-KH)
SLR43	High quality TTS data for Nepali.	Speech	Multi-speaker TTS data for Nepali (ne-NP)
SLR44	High quality TTS data for Sundanese.	Speech	Multi-speaker TTS data for Sundanese (su-ID)
SLR45	Free ST American English Corpus	Speech	A free American English corpus by Surfingtech (www.surfing.ai), containing utterances from 10 speakers, Each speaker has about 350 utterances;
SLR46	Tunisian_MSA	Speech	Tunisian Modern Standard Arabic
SLR47	Primewords Chinese Corpus Set 1	Speech	Chinese Mandarin corpus released by Shanghai Primewords Co. Ltd. (www.primewords.cn), containing 100 hours of speech data.
SLR48	MADCAT Arabic data splits	Other	Unofficial data splits (dev/train/test) for the MADCAT Arabic LDC corpus
SLR49	VoxCeleb Data	Misc	Various files for the VoxCeleb datasets
SLR50	MADCAT Chinese data splits	Other	Unofficial data splits (dev/train/test) for the MADCAT Chinese LDC corpus
SLR52	Large Sinhala ASR training data set	Speech	Sinhala ASR training data set containing ~185K utterances.
SLR53	Large Bengali ASR training data set	Speech	Bengali ASR training data set containing ~196K utterances.
SLR54	Large Nepali ASR training data set	Speech	Nepali ASR training data set containing ~157K utterances.
SLR55	CLMAD	Text	A Chinese Language Model Adaptation Dataset (CLMAD).
SLR56	IAM Aachen splits	Other	Aachen data splits (train/test/val) for the IAM dataset.
SLR57	African Accented French	Speech	Recordings of African Accented French speech.
SLR58	Pansori-TEDxKR	Speech	Korean speech corpus generated from Korean language TEDx talks
SLR59	ParlamentParla	Speech	Catalan speech corpus generated from Catalan Parliamentary sessions
SLR60	LibriTTS corpus	Speech	Large-scale corpus of English speech derived from the original materials of the LibriSpeech corpus
SLR61	Crowdsourced high-quality Argentinian Spanish speech data set.	Speech	Data set which contains 5739 recordings of native speakers of Spanish
SLR62	aidatatang_200zh	Speech	A Chinese Mandarin speech corpus by Beijing DataTang Technology Co., Ltd, containing 200 hours of speech data from 600 speakers. The transcription accuracy for each sentence is larger than 98%.
SLR63	Crowdsourced high-quality Malayalam multi-speaker speech data set.	Speech	Data set which contains recordings of native speakers of Malayalam.
SLR64	Crowdsourced high-quality Marathi multi-speaker speech data set.	Speech	Data set which contains recordings of native speakers of Marathi
SLR65	Crowdsourced high-quality Tamil multi-speaker speech data set.	Speech	Data set which contains recordings of native speakers of Tamil.
SLR66	Crowdsourced high-quality Telugu multi-speaker speech data set.	Speech	Data set which contains recordings of native speakers of Telugu.
SLR67	TEDx Spanish Corpus	Speech	Spanish data taken from the TEDx Talks
SLR68	MAGICDATA Mandarin Chinese Read Speech Corpus	Speech	The corpus by Magic Data Technology Co., Ltd. , containing 755 hours of scripted read speech data from 1080 native speakers of the Mandarin Chinese spoken in mainland China. The sentence transcription accuracy is higher than 98%.
SLR69	Crowdsourced high-quality Catalan speech data set.	Speech	Data set which contains recordings of Catalan.
SLR70	Crowdsourced high-quality Nigerian English speech data set.	Speech	Data set which contains recordings of Nigerian English.
SLR71	Crowdsourced high-quality Chilean Spanish speech data set.	Speech	Data set which contains recordings of Chilean Spanish.
SLR72	Crowdsourced high-quality Colombian Spanish speech data set.	Speech	Data set which contains recordings of Colombian Spanish.
SLR73	Crowdsourced high-quality Peruvian Spanish speech data set.	Speech	Data set which contains recordings of Peruvian Spanish.
SLR74	Crowdsourced high-quality Puerto Rico Spanish speech data set.	Speech	Data set which contains recordings of Puerto Rico Spanish.
SLR75	Crowdsourced high-quality Venezuelan Spanish speech data set.	Speech	Data set which contains recordings of Venezuelan Spanish.
SLR76	Crowdsourced high-quality Basque speech data set.	Speech	Data set which contains recordings of Basque.
SLR77	Crowdsourced high-quality Galician speech data set.	Speech	Data set which contains recordings of Galician.
SLR78	Crowdsourced high-quality Gujarati multi-speaker speech data set.	Speech	Data set which contains recordings of native speakers of Gujarati.
SLR79	Crowdsourced high-quality Kannada multi-speaker speech data set.	Speech	Data set which contains recordings of native speakers of Kannada.
SLR80	Crowdsourced high-quality Burmese speech data set.	Speech	Data set which contains recordings of Burmese.
SLR81	Small Audio Clips	Speech	Contains 20 one-second audio clips from various sources, for testing compression algorithms
SLR82	CN-Celeb	Speech	A Free Chinese Speaker Recognition Corpus Released by CSLT@Tsinghua University
SLR83	Crowdsourced high-quality UK and Ireland English Dialect speech data set.	Speech	Data set which contains male and female recordings of English from various dialects of the UK and Ireland.
SLR84	ScribbleLens	Handwriting	Dutch cursive, 16-18th century handwritings, pages and lines, for (un)supervised AI and other research.
SLR85	HI-MIA	Speech	A far-field text-dependent speaker verification database for AISHELL Speaker Verification Challenge 2019
SLR86	Crowdsourced high-quality Yoruba speech data set.	Speech	Data set which contains recordings of Yoruba.
SLR87	MobvoiHotwords	Speech	Chinese hotwords detection dataset, provided by Mobvoi CO.,LTD
SLR88	Att-HACK	Speech	French Expressive Speech Database with Social Attitudes
SLR89	Yoloxóchitl-Mixtec	Speech	Yolóxochitl Mixtec Speech with Transcription
SLR92	Puebla-Nahuatl	Speech	Puebla Nahuatl Speech with Transcription
SLR93	AISHELL-3	Speech	Mandarin data, provided by Beijing Shell Shell Technology Co., Ltd.
SLR94	Multilingual LibriSpeech (MLS)	Speech	A large multilingual corpus derived from LibriVox audiobooks
SLR95	Thorsten Müller (German Neutral-TTS dataset)	Speech	Free single german speaker dataset (> 23 hours) by Thorsten Müller (voice) and Dominik Kreutz (audio optimization) for tts training
SLR96	Russian LibriSpeech (RuLS)	Speech	This dataset is based on LibriVox audiobooks
SLR97	Deeply Korean read speech corpus	Speech	Pairs of Korean reading the scripts with 3 text sentiments using 3 vocal sentiments. Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone.
SLR98	Deeply parent-child vocal interaction dataset	Speech	The interaction of pairs of parent and child(reading fairy tales, singing children’s songs, conversing, and others).Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone.
SLR99	Deeply Nonverbal Vocalization Dataset	Audio	A human nonverbal vocal sound dataset by Deeply Inc.
SLR100	Multilingual TEDx	Speech	a multilingual corpus of TEDx talks for speech recognition and translation
SLR101	speechocean762	Speech	Pronunciation scoring dataset, labeled independently by five human experts
SLR102	Kazakh Speech Corpus (KSC)	Speech	A crowdsourced open-source Kazakh speech corpus developed by ISSAI (330 hours)
SLR103	Multilingual and code-switching ASR Challenge Dataset - sub-task1	Speech	Datasets for sub-task1 in Multilingual and code-switching ASR challenges for low resource Indian languages - MUCS 2021 (https://navana-tech.github.io/MUCS2021/)
SLR104	Multilingual and code-switching ASR Challenge Dataset - sub-task2	Speech	Datasets for sub-task2 in Multilingual and code-switching ASR challenges for low resource Indian languages - MUCS 2021 (https://navana-tech.github.io/MUCS2021/)
SLR105	nicolingua-0003-west-african-radio-corpus	Speech	West African Radio Corpus
SLR106	nicolingua-0004-west-african-va-asr-corpus	Speech	West African Virtual Assistant Speech Recognition Corpus
SLR107	Totonac Resources	Speech	Totonac Speech with Transcription
SLR108	MediaSpeech	Speech	French, Arabic, Turkish and Spanish media speech datasets
SLR109	Hi-Fi Multi-Speaker English TTS Dataset (Hi-Fi TTS)	Speech	A multi-speaker English dataset for training text-to-speech models
SLR110	Thorsten Müller (German Emotional-TTS dataset)	Speech	Free EMOTIONAL single german speaker dataset (Neutral, Disgusted, Angry, Amused, Surprised, Sleepy, Drunk, Whispering) by Thorsten Müller (voice) and Dominik Kreutz (audio optimization) for TTS training
SLR111	AISHELL-4	Speech	A Free Mandarin Multi-channel Meeting Speech Corpus, provided by Beijing Shell Shell Technology Co.,Ltd
SLR112	Samromur 21.05	Speech	Samrómur Icelandic Speech corpus approved for release in May 2021
SLR113	SEOUL CORPUS	Speech	The Korean Corpus of Spontaneous Speech (aka, Seoul Corpus), created from the NRF(Korea)-funded project
SLR114	Golos	Speech	Russian ASR dataset (1240 hours) with trained acoustic and language models
SLR115	EmoV_DB	Speech	a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English (https://github.com/numediart/EmoV-DB)
SLR116	Samrómur Queries 21.12	Speech	Samrómur Icelandic Speech corpus focused on queries and approved for release in December 2021
SLR117	Samrómur Children 21.09	Speech	Samrómur Icelandic Speech from children (ages 4-17 years) approved for release in September 2021
SLR118	1111 Hours Hindi ASR Challenge	Speech	Datasets for 1111 Hours Hindi ASR Challenge Closed, Self Supervised Closed and Open - 2022 (https://sites.google.com/view/gramvaaniasrchallenge/home)
SLR119	AliMeeting	Speech	A Free Mandarin Multi-channel Meeting Speech Corpus, provided by Alibaba Group
SLR120	HI-MIA-CW	Speech	A Free Mandarin Supplemental Speech Corpus to HI-MIA Database, whose contents are negative samples for wake-up words "Hi, Mia".
SLR121	WenetSpeech	Speech	A 10000+ Hours Multi-Domain Mandarin Corpus for Speech Recognition
SLR122	Kashmiri Data Corpus	Speech	An audio and text corpus for the Kashmiri language
SLR123	MAGICDATA Mandarin Chinese Conversational Speech Corpus	Speech	The corpus by Magic Data Technology Co., Ltd. , containing 180 hours of rich annotated Mandarin spontaneous conversational speech data.
SLR124	TIBMD@MUC speech data set	Speech	A Tibetan multi-dialect speech data ( 84.33 hours)
SLR125	Basic LAnguage Resource Kit 1.0 for Faroese	Speech	Faroese Speech corpus approved for release in July 2022
SLR126	IISc-MILE Kannada ASR Corpus	Speech	Kannada transcribed speech corpus for ASR
SLR127	IISc-MILE Tamil ASR Corpus	Speech	Tamil transcribed speech corpus for ASR
SLR128	Samrómur Unverified 22.07	Speech	Samrómur Icelandic Speech, 2,200 hours of mostly unverified data approved for release in July 2022
SLR129	BibleTTS	Speech	A large, high-fidelity, multilingual, and uniquely African speech corpus
SLR130	Samrómur L2 22.09	Speech	Samrómur Icelandic Speech, 150 hours from people with Icelandic as a second language. Approved for release in July 2022
SLR131	Samrómur Mimic 22.09	Speech	Samrómur Icelandic Speech, 66.7 hours of speech where users mimic utterances. Approved in September 2022
SLR132	Mohammed	Speech	Arabic speech to text Quran data
SLR133	XBMU-AMDO31	Speech	Tibetan Amdo dialect speech data from NLIT, Northwest Minzu University
SLR134	SASPEECH	Speech	Hebrew speech and transcripts by a single speaker (30 hours)
SLR135	Libri-Mixed-Speakers	Speech	English audio of simultaneous speakers derived from LibriTTS
SLR136	EMNS	Speech, text-to-speech, automatic speech recognition	An emotive single-speaker dataset for narrative storytelling. EMNS is dataset containing transcriptions, emotion, emotion intensity, and description of acted speech.
SLR137	Silbo Gomero Speech Corpus	Speech	Corpus of the Silbo Gomero whistled language, based on 49 minutes of recordings created by 4 whistlers.
SLR138	SHALCAS22A	Speech	A Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd.
SLR139	Audiocite.net	Speech	Spoken dataset of books read in French, initially collected from audiocite.net by the GETALP team for the LeBenchmark project.
SLR140	Kazakh Speech Dataset (KSD)	Speech	High-quality open source Kazakh speech corpus developed by the Department of Artificial Intelligence and Big Data of Al-Farabi Kazakh National University (554 hours)
SLR141	LibriTTS-R	Speech	Sound quality improved version of the LibriTTS corpus which is a large-scale corpus of English speech designed for TTS use
SLR142	The MC Speech Dataset	Speech	Free speech dataset consisting of 24018 short audio clips of a single speaker reading sentences in Polish
SLR143	Nepali Text-to-Speech Data (Male and Female)	Speech	Nepali speech and corresponding text data in male and female voice
SLR144	SlideSpeech	Audio-Visual Speech	A Large-scale English Multi-Modal Audio-Visual Corpus, provided by Alibaba Group
SLR145	LibriSpeech-PC	Text	LibriSpeech text with Punctuation and Capitalization
SLR146	CML-TTS Dataset	Speech	CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages
SLR147	Veracruz Orizaba Nahuatl Endangered Language	Speech	Audio corpus of Orizaba (Veracruz) Nahuatl speech (Glottocode: oriz1235; ISO 639-3: nlv)
SLR148	Tepetzintla Zacatlan Nahuatl Endangered Language	Speech	Audio corpus of Zacatlán-Ahuacatlán-Tepetzintla (Puebla) Nahuatl speech (Glottocode: zaca1241; ISO 639-3: nhi)
SLR149	Tibetan Greetings	Speech	Selected Tibetan greetings speech data categorized according to the dialectal region.
SLR150	CHiME-6	Speech	English multi-channel far field meeting data used in the CHiME-6 Challenge. It is derived from CHiME-5 by fixing some array synchronization errors.
SLR151	Kallaama	Speech	Wolof, Pulaar and Sereer data
SLR152	Pragmatic Similarity Judgments	Speech	Judgments of perceived similarity between utterance pairs from dialogs, in English and Spanish.
SLR153	Yerevan City Magazine	Text	A Free Armenian News Text Corpus, provided by Qaghaki Amsagir LLC (Yerevan City Magazine, evnmag.com)
SLR154	ArmenianGrqaserAudioBooks	Speech	Cutted, Segmented, Processed (speech, text) paired data, derived from the Grqaser.org audiobooks
SLR155	SBCSAE	Speech	The Santa Barbara Corpus of Spoken American English, mirrored from UCSB
SLR156	SMIIP-TV dataset	Speech	A short-term time-varying speaker verificaition dataset
SLR157	Sagalee	Speech	Automatic Speech Recognition Dataset for Oromo Language
SLR158	NICT-Tib1	Speech	33.5-hour Lhasa-Tibetan read-speech corpus with Kaldi-style transcripts
SLR159	AISHELL-5	Speech	The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition, provided by Beijing AISHELL Technology Co.,Ltd.
SLR160	Armenian Speech Crowdsourcing Data	Speech	70 hours of Armenian speech collected via crowdsourcing with Toloka and texts from Yerevan City Magazine.
SLR161	Emozionalmente	Speech	A crowdsourced Italian emotional speech corpus