openslr.org

Open Speech and Language Resources

ArmenianGrqaserAudioBooks

Identifier: SLR154

Summary: Cutted, Segmented, Processed (speech, text) paired data, derived from the Grqaser.org audiobooks

Category: Speech

License: Creative Commons Attribution Share Alike 4.0 (CC BY-SA 4.0)

Downloads (use a mirror closer to you):
grqaser_audiobooks.tgz [error getting size] ( Armenian speech and transcripts ) Mirrors: [US] [EU] [CN]

About this resource:

This dataset is part of our effort to increase the amount of data available for low-resource languages like Armenian and Georgian.
It consists of processed audiobooks, which initially consisted of single big transcript and tens of minutes long audios for each chapter.
To make the data ASR/TTS friendly we converted the raw corpus and many multi second long audio chunks (typically 3-15seconds) with corresponding texts.

We coordinated with the original authors from Grqaser.org, who agreed on the selection of new books we processed.
To make the reconstruction of the books (usually different speakers per book) harder, we encoded the names of audios
and hide book, chapter and author information. This is done to avoid Voice Cloning attempts on TTS setup (as the
majority of the data were collected on voluntary bases and cloning the voices of those people is forbidden).

The .tgz file contains the following directories:

texts/ - Contains text transcripts in .txt format
audios/ - Contains audio files in .wav format

About the original source (Grqaser):

    "Grqaser" is a non-governmental organization dedicated to promoting Armenian language preservation globally through
    the creation of a comprehensive library of Armenian audiobooks. Established in 2015, "Grqaser" aims to facilitate
    access to Armenian literature for diaspora communities and individuals with visual impairments. Their initiative
    provides a valuable resource for listeners to engage with Armenian culture and language through accessible audio
    formats, supporting educational and cultural enrichment worldwide.

Author(s) of the corpus

Ara Yeroyan, ar23yeroyan gmail.com

External URL: https://grqaser.org Source for original audiobooks (raw data)