Audiocite.net
Identifier: SLR139
Summary: Spoken dataset of books read in French, initially collected from audiocite.net by the GETALP team for the LeBenchmark project.
Category: Speech
License: Depends on each audio file (CC BY (Public domain) or CC BY SA (Share alike) or CC BY ND (No modification) or CC BY NC (No commercial use) or combination of the above)
Downloads (use a mirror closer to you):
audiocite.net_md5.txt [error getting size] (Checksums
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_0.zip [error getting size] (Metadata and Scripts Files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_1.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_2.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_3.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_4.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_5.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_6.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_7.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_8.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_9.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_10.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_11.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_12.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_13.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_14.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_15.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_16.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_17.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_18.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_19.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_20.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_21.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_22.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_23.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_24.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_25.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_26.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_27.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_28.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_29.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_30.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_31.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_32.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_33.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
audiocite.net_34.zip [error getting size] (mp3 files
) Mirrors:
[US]
[EU]
[CN]
About this resource:
With a total duration of 6682 hours of audio recording, this corpus is the result of the voluntary work of 130 speakers. The metadata is divided into 4 .jsons files (all(100%), train(80%), dev(10%) and test(10%)) to be used in NLP models.
The corpus and its metadata were uploaded through a script distributing the information in a .csv file. The use of these audio and metadata files is intended for pre-trained speech models.
Speakers Gender* | Number of Files | Number of Speakers | Total Duration | Avg. Duration | Min. Duration | Max. Duration |
---|---|---|---|---|---|---|
M | 19345 | 51 | 4127:34:45 | 00:12:48 | 00:15:01 | 06:39:43 |
F | 8261 | 70 | 2272:18:23 | 00:16:30 | 00:03:49 | 02:46:06 |
U | 879 | 9 | 282:07:09 | 00:19:15 | 00:00:09 | 03:30:16 |
Total | 28485 | 130 | 6682:00:18 | 00:14:04 | 00:00:09 | 06:39:43 |
*Beware speaker gender has been guessed and should not be considered as ground truth (cf. README.md of the audiocite.net_0.zip) You can cite the data using the following BibTeX entry:
@inproceedings{Felice2024, title={Audiocite.net: A Large Spoken Read Dataset in French}, author={Soline Felice and Solène Evain and Solange Rossato and François Portet}, booktitle={The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, year={2024} }And please cite the Audiocite.net website (https://www.audiocite.net/) if you use this dataset for your own research.