Open Speech and Language Resources

Multilingual and code-switching ASR Challenge Dataset - sub-task2

Identifier: SLR104

Summary: Datasets for sub-task2 in Multilingual and code-switching ASR challenges for low resource Indian languages - MUCS 2021 (https://navana-tech.github.io/MUCS2021/)

Category: Speech

License: CC BY-SA 4.0

Downloads (use a mirror closer to you):
Hindi-English_train.tar.gz [7.3G] ( Hindi-English code-switched train speech and transcripts ) Mirrors: [US] [EU] [CN]
Hindi-English_test.tar.gz [443M] ( Hindi-English code-switched test speech and transcripts ) Mirrors: [US] [EU] [CN]
Bengali-English_train.tar.gz [3.9G] ( Bengali-English code-switched train speech and transcripts ) Mirrors: [US] [EU] [CN]
Bengali-English_test.tar.gz [606M] ( Bengali-English code-switched test speech and transcripts ) Mirrors: [US] [EU] [CN]
subtask2_blindtest_wReadme.tar.gz [1.0G] (sub-task2 Blind Test set and transcripts ) Mirrors: [US] [EU] [CN]

About this resource:

Summary of Hindi-English and Bengali-English Data

The Hindi-English and Bengali-English datasets are extracted from spoken tutorials. These tutorials cover a range of technical topics and the code-switching predominantly arises from the technical content of the lectures. The segments file in the baseline recipe provides sentence time-stamps. These time-stamps were used to derive segments from the audio file to be aligned with the transcripts given in the text file. Hindi-English train and test datasets contain 89.86 hours and 5.18 hours, respectively, while the Bengali-English train and test datasets contain 46.11 hours and 7.02 hours of speech, respectively. All the audio files in both datasets are sampled at 16 kHz, 16 bits encoding. The vocabulary size for Hindi-English and Bengali-English are 17877 and 13656, respectively.

Summary of the blind test data

In addition to the train and test sets, the blind test set for subtask2 is also provided.