中文语音语料-THCHS30

标签:
语音语料中文thchs |
分类: AI/ML |
中文语音语料(13388个片段)
data_thchs30.tgz [6.4G]
( speech data and transcripts )
Mirrors: [China]
About this resource:
A Free Chinese Speech Corpus Released by CSLT@Tsinghua
University.
THCHS30 is an open Chinese speech database published by Center
for Speech and Language Technology (CSLT) at Tsinghua University.
The origional recording was conducted in 2002 by Dong Wang,
supervised by Prof. Xiaoyan Zhu, at the Key State Lab of
Intelligence and System, Department of Computer Science, Tsinghua
Universeity, and the original name was 'TCMSD', standing for
'Tsinghua Continuous Mandarin Speech Database'. The publication
after 13 years has been initiated by Dr. Dong Wang and was
supported by Prof. Xiaoyan Zhu. We hope to provide a toy database
for new researchers in the field of speech recognition. Therefore,
the database is totally free to academic users. You can cite the
data using the following BibTeX entry:
@misc{THCHS30_2015,
}
参考:今天在清华大学cslt实验室王东老师的分享下,kaldi终于有了免费的中文语音识别的例子,
github网址。各位可以根据这个来训练自己的模型。
首发时间: 2000 - 2001 [13][Dong Wang, Dalei Wu, and Xiaoyan
Zhu, \TCMSD: a new chinese
continuous speech database," in International Conference on
Chinese
Computing (ICCC'01), 2001,, 2001.]
目的:作为 863 数据库的补充,双音素,三音素的覆盖率对比如表1:
命名: THCHS-30, 代表 `Tsinghua Chinese 30 hour
database'清华中文语音30小时数据库
录制环境: a single carbon microphone ,安静的办公室里.
录音参与者: 能够熟练说普通话年轻的大学生
样本: The sampling rate: 16000 Hz, the sample size is 16
bits.
脚本来源: 新闻, 1000 句
切分:
A (sentence ID from 1 to 250),
B (sentence ID from 251 to 500),
C (sentence ID form 501 to 750),
D (sentence ID from 751 to 1000).
训练:A, B and C,30 speakers and 10893 utterances
测试:D, which involves 10 speakers and 2496
utterances.