SELL-CORPUS is a multiple accented speech corpus for L2 English learning in China, aiming at the potential research of multiple accented acoustic model, mispronunciation detection and pronunciation assessment for future nationwide oral English tests. Our corpus contains 31.6 hour speech recordings contributed by 389 volunteer speakers, including 186 males and 203 females. Our corpus fully covers seven major regional dialects and provides a baseline for Chinese multiple accented automatic speech recognition system. Our corpus cover all seven regional dialects in China: Mandarin(north and southwest regions), Wu language, Cantonese, Gan dialect, Minnan dialect, Xiang dialect and Hakka. Considering that Mandarin accents vary widely across the north and southwest of China, we further briefly divide Mandarin into north Mandarin and southwest Mandarin according to their accent resemblance. The population distribution for these dialectal regions is shown in the figure below.
Statistics on speakers'gender and recording hours in our corpus is shown in the table below.
Statistics on speakers'gender, utterances and recording hours in our corpus is shown in the table below.
|# of speakers||185||31||108||13||24||10||18|
|# of male||98||9||39||6||19||10||9|
|# of female||87||22||69||7||5||0||9|
|# of utterances||5830||689||3714||398||613||300||643|
We use kaldi toolkit to built our baseline ASR system. Here are WER of our model based on SELL-Corpus dev set and test set .
The corpus is released under the CC BY-NC 4.0 license, please read the license before using.You can read the full license here.Download LICENSE
You can download the entire SELL-CORPUS archive here. Notice that the size of the entire SELL-CORPUS archive is 2.6GB. It may takes some times to download.Download SELL-CORPUS(full, 2.6G) SELL-CORPUS md5
If you want to get a general idea of the structure and content of the corpus, we have selected two volunteers' utterances each dialect from the whole corpus to form a samples archive. The size of it is 134MB. You can download samples in here:Download SELL-CORPUS(samples) SELL-CORPUS(samples) md5
We have manually annotated 1600 utterances by selecting 8 datasets from the seven major dialectal regions and each dataset contains about 200 utterances. You can download here and more detail information is in the readme file of the archive.Download SELL-CORPUS(Annotation) SELL-CORPUS(Annotation) md5