The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC)
The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC) is a corpus of spoken Mandarin Chinese. The corpus is composed of 924,242 words of dialogues and monologues, both spontaneous and unscripted, in 70,569 sentences and 48,938 utterance units (paragraphs). LLSCC has six subcorpora, which are described below.
Conversations: 6 transcripts of face-to-face conversation, totalling 60,806 words;
Telephone Calls: 120 transcripts of telephone conversation between overseas Chinese and their families in China, totalling 295,026;
Play & Movie Transcripts: 12 transcripts of actual performances of TV plays, operas and movies, totalling 80,446 words;
TV Talk Show Transcripts: 20 transcripts of the CCTV talk show Shi Hua Shi Shuo (Tell It Like It Is), totalling 118,588 words;
Oral Narratives: 49 narratives of native Beijing residents, totalling 102,262 words;
Edited Oral Narratives: 100 Chinese profiles (Beijing Ren edited by Zhang Xinxin & Sang Ye), totalling 267,114 words.
The corpus is XML-compliant. Each corpus file is composed of a corpus header and a text body. The header gives general information of a corpus file. In the body part, utterance units (or paragraphs), sentences and tokens are marked up, with each token also annotated for part of speech.
The corpus is a joint project undertaken by Dr. Richard Xiao (UCREL of Lancaster University) and Professor Hongyin Tao (University of California Los Angeles). Regrettably, this corpus cannot be released to the public for the time being because of copyright restrictions.
The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC) is a corpus of spoken Mandarin Chinese. The corpus is composed of 924,242 words of dialogues and monologues, both spontaneous and unscripted, in 70,569 sentences and 48,938 utterance units (paragraphs). LLSCC has six subcorpora, which are described below.
Conversations: 6 transcripts of face-to-face conversation, totalling 60,806 words;
Telephone Calls: 120 transcripts of telephone conversation between overseas Chinese and their families in China, totalling 295,026;
Play & Movie Transcripts: 12 transcripts of actual performances of TV plays, operas and movies, totalling 80,446 words;
TV Talk Show Transcripts: 20 transcripts of the CCTV talk show Shi Hua Shi Shuo (Tell It Like It Is), totalling 118,588 words;
Oral Narratives: 49 narratives of native Beijing residents, totalling 102,262 words;
Edited Oral Narratives: 100 Chinese profiles (Beijing Ren edited by Zhang Xinxin & Sang Ye), totalling 267,114 words.
The corpus is XML-compliant. Each corpus file is composed of a corpus header and a text body. The header gives general information of a corpus file. In the body part, utterance units (or paragraphs), sentences and tokens are marked up, with each token also annotated for part of speech.
The corpus is a joint project undertaken by Dr. Richard Xiao (UCREL of Lancaster University) and Professor Hongyin Tao (University of California Los Angeles). Regrettably, this corpus cannot be released to the public for the time being because of copyright restrictions.