SELL-CORPUS


SELL-CORPUS is a multiple accented speech corpus for L2 English learning in China, aiming at the potential research of multiple accented acoustic model, mispronunciation detection and pronunciation assessment for future nationwide oral English tests. Our corpus contains 31.6 hour speech recordings contributed by 389 volunteer speakers, including 186 males and 203 females. Our corpus fully covers seven major regional dialects and provides a baseline for Chinese multiple accented automatic speech recognition system.

Our corpus cover all seven regional dialects in China: Mandarin(north and southwest regions), Wu language, Cantonese, Gan dialect, Minnan dialect, Xiang dialect and Hakka. Considering that Mandarin accents vary widely across the north and southwest of China, we further briefly divide Mandarin into north Mandarin and southwest Mandarin according to their accent resemblance. The population distribution for these dialectal regions is shown in the figure below.


Geographical distribution of major Chinese dialects
Geographical distribution of major Chinese dialects

Structure

Statistics on speakers'gender and recording hours in our corpus is shown in the table below.

data-set duration(hours) male (hours) female(hours)
training 27.2 14.0 13.2
development 2.3 1.4 0.9
test 2.1 1.3 0.8


Statistics

Statistics on speakers'gender, utterances and recording hours in our corpus is shown in the table below.

Dialects Mandarin
(North/Southwest)
Cantonese
(Yue)
Wu Xiang Minnan Hakka
(Kejia)
Gan
# of speakers 185 31 108 13 24 10 18
# of male 98 9 39 6 19 10 9
# of female 87 22 69 7 5 0 9
# of utterances 5830 689 3714 398 613 300 643
duration(hours) 14.8 1.7 9.6 1.0 1.7 0.9 1.9


Experiment

We use kaldi toolkit to built our baseline ASR system. Here are WER of our model based on SELL-Corpus dev set and test set .

train model dev-set test-set
GMM-HMM
(LDA+MLLT+SAT)
17.09 17.76
Chain-TDNN 10.00 11.51


License

The corpus is released under the CC BY-NC 4.0 license, please read the license before using.You can read the full license here.

Download LICENSE


Download

You can download the entire SELL-CORPUS archive here. Notice that the size of the entire SELL-CORPUS archive is 2.6GB. It may takes some times to download.

Download SELL-CORPUS(full, 2.6G)
SELL-CORPUS md5

If you want to get a general idea of the structure and content of the corpus, we have selected two volunteers' utterances each dialect from the whole corpus to form a samples archive. The size of it is 134MB. You can download samples in here:

Download SELL-CORPUS(samples)
SELL-CORPUS(samples) md5

We have manually annotated 1600 utterances by selecting 8 datasets from the seven major dialectal regions and each dataset contains about 200 utterances. You can download here and more detail information is in the readme file of the archive.

Download SELL-CORPUS(Annotation)
SELL-CORPUS(Annotation) md5