Open Speech and Language Resources



The MC Speech Dataset

Identifier: SLR142

Summary: Free speech dataset consisting of 24018 short audio clips of a single speaker reading sentences in Polish

Category: Speech

License: CC0 1.0

Downloads (use a mirror closer to you):
mcspeech.tar.gz [error getting size]   ()   Mirrors: [US]   [EU]   [CN]  

About this resource:

This is public domain speech dataset consisting of 24018 short audio clips of a single speaker reading sentences in Polish. A transcription is provided for each clip. Clips have total length of more than 22 hours.

Texts are in public domain. The audio was recorded in 2021-22 as a part of my master's thesis and is in public domain.

If you use this dataset, please cite:
@masterthesis{mcspeech,
  title={Analiza porównawcza korpusów nagrań mowy dla celów syntezy mowy w języku polskim},
  author={Czyżnikiewicz, Mateusz},
  year={2022},
  month={December},
  school={Warsaw University of Technology},
  type={Master's thesis},
  doi={10.13140/RG.2.2.26293.24800},
  note={Available at \url{http://dx.doi.org/10.13140/RG.2.2.26293.24800}},
}

More info about the dataset can be found at https://github.com/czyzi0/the-mc-speech-dataset

External URL: https://github.com/czyzi0/the-mc-speech-dataset