Open Speech and Language Resources



Large Javanese ASR training data set

Identifier: SLR35

Summary: Javanese ASR training data set containing ~185K utterances.

Category: Speech

License: Attribution-ShareAlike 4.0 International

Downloads (use a mirror closer to you):
asr_javanese.sha256 [1.3K]   (Checksum for the files )   Mirrors: [US]   [EU]   [CN]  
LICENSE [error getting size]   (License information for the data set )   Mirrors: [US]   [EU]   [CN]  
utt_spk_text.tsv [error getting size]   (All utterances in the data set )   Mirrors: [US]   [EU]   [CN]  
asr_javanese_0.zip [error getting size]   (Data set, file 0/15 )   Mirrors: [US]   [EU]   [CN]  
asr_javanese_1.zip [error getting size]   (Data set, file 1/15 )   Mirrors: [US]   [EU]   [CN]  
asr_javanese_2.zip [error getting size]   (Data set, file 2/15 )   Mirrors: [US]   [EU]   [CN]  
asr_javanese_3.zip [error getting size]   (Data set, file 3/15 )   Mirrors: [US]   [EU]   [CN]  
asr_javanese_4.zip [error getting size]   (Data set, file 4/15 )   Mirrors: [US]   [EU]   [CN]  
asr_javanese_5.zip [error getting size]   (Data set, file 5/15 )   Mirrors: [US]   [EU]   [CN]  
asr_javanese_6.zip [error getting size]   (Data set, file 6/15 )   Mirrors: [US]   [EU]   [CN]  
asr_javanese_7.zip [error getting size]   (Data set, file 7/15 )   Mirrors: [US]   [EU]   [CN]  
asr_javanese_8.zip [error getting size]   (Data set, file 8/15 )   Mirrors: [US]   [EU]   [CN]  
asr_javanese_9.zip [error getting size]   (Data set, file 9/15 )   Mirrors: [US]   [EU]   [CN]  
asr_javanese_a.zip [error getting size]   (Data set, file 10/15 )   Mirrors: [US]   [EU]   [CN]  
asr_javanese_b.zip [error getting size]   (Data set, file 11/15 )   Mirrors: [US]   [EU]   [CN]  
asr_javanese_c.zip [error getting size]   (Data set, file 12/15 )   Mirrors: [US]   [EU]   [CN]  
asr_javanese_d.zip [error getting size]   (Data set, file 13/15 )   Mirrors: [US]   [EU]   [CN]  
asr_javanese_e.zip [error getting size]   (Data set, file 14/15 )   Mirrors: [US]   [EU]   [CN]  
asr_javanese_f.zip [error getting size]   (Data set, file 15/15 )   Mirrors: [US]   [EU]   [CN]  

About this resource:

This data set contains transcribed audio data for Javanese. The data set consists of wave files, and a TSV file. The file utt_spk_text.tsv contains a FileID, UserID and the transcription of audio in the file.

The data set has been manually quality checked, but there might still be errors.

This dataset was collected by Google in collaboration with Reykjavik University and Universitas Gadjah Mada in Indonesia.

See LICENSE.txt file for license information.

Copyright 2016, 2017 Google, Inc.

If you use this data in publications, please cite it as follows:

  @inproceedings{kjartansson-etal-sltu2018,
    title = {{Crowd-Sourced Speech Corpora for Javanese, Sundanese,  Sinhala, Nepali, and Bangladeshi Bengali}},
    author = {Oddur Kjartansson and Supheakmungkol Sarin and Knot Pipatsrisawat and Martin Jansche and Linne Ha},
    booktitle = {Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU)},
    year  = {2018},
    address = {Gurugram, India},
    month = aug,
    pages = {52--55},
    URL   = {http://dx.doi.org/10.21437/SLTU.2018-11},
  }