Audio samples from "DiDiSpeech: A Large Scale Mandarin Speech Corpus"

Paper: arXiv

Authors: Tingwei Guo, Cheng Wen, DongWei Jiang, Ne Luo, RuiXiong Zhang, ShuaiJiang Zhao, WuBo Li, Cheng Gong, Wei Zou, Kun Han, XianGang Li

Abstract: This paper introduces a new open-sourced Mandarin speech corpus, called DiDiSpeech. It consists of about 800 hours of speech data at 48kHz sampling rate from 6000 speakers and the corresponding texts. All speech data in the corpus was recorded in quiet environment and is suitable for various speech processing tasks, such as voice conversion, multi-speaker text-to-speech and aucomatic speech recognation. We conduct experiments with multiple speech tasks and evaluate the performance, showing that it is promising to use the corpus for both academic research and practical application. The corpus is available at https://outreach.didichuxing.com/research/opendata/.

For more information, refer to the paper "DiDiSpeech: A Large Scale Mandarin Speech Corpus", Tingwei Guo, Cheng Wen, Dongwei Jiang, Ne Luo, Ruixiong Zhang, Shuaijiang Zhao, Wubo Li, Cheng Gong, Wei Zou, Kun Han, Xiangang Li, arXiv:2010.09275, 2020. If you use the DiDiSpeech corpus in your work, please cite this paper where it was introduced.

Multi-speaker TTS

This section displays the synthesized audio samples of our multi-speaker speech synthesis models trained on the DiDiSpeech corpus. Each column corresponds to a single speaker. The first row consists of the reference audio of all speakers, where the rows below is composed of audio samples synthesized by our models.

1、Seen Speakers

Speaker 1 Speaker 2 Speaker 3 Speaker 4 Speaker 5

Reference

Synthesized audio 1

Synthesized audio 2

2、Unseen Speakers

Speaker 1 Speaker 2 Speaker 3 Speaker 4 Speaker 5

Reference

Synthesized audio 1

Synthesized audio 2

Voice conversion

Audio samples of both the parallel and non-parallel voice conversion (VC) models trained on the DiDiSpeech corpus are provided here. In the rest of this section, the source and target audio, which has been separated from the training data, is the speech samples recorded from source and target speakers respectively. The converted audio is the speech samples converted from the source audio in the same line by using our VC models.

1、Parallel VC

Source audio Target audio Converted audio

Inter-gender sample (Female)

Inter-gender sample (Male)

Intra-gender sample (Male to Female)

Intra-gender sample (Female to Male)

2、Non-parallel VC

Source audio Target audio Converted audio

Inter-gender sample (Female)

Inter-gender sample (Male)

Intra-gender sample (Male to Female)

Intra-gender sample (Female to Male)

	Speaker 1	Speaker 2	Speaker 3	Speaker 4	Speaker 5
Reference
Synthesized audio 1
Synthesized audio 2

	Source audio	Target audio	Converted audio
Inter-gender sample (Female)

Inter-gender sample (Male)

Intra-gender sample (Male to Female)

Intra-gender sample (Female to Male)