New releases from the LDC


Staff member
English Gigaword Second Edition

HKUST Mandarin Telephone Speech, Part 1

HKUST Mandarin Telephone Transcript Data, Part 1

The Linguistic Data Consortium (LDC) would like to announce the availability of three new corpora.


English Gigaword Second Edition is a comprehensive archive of newswire text data in English that has been acquired over several years by the LDC. This release includes all of the contents in the first release of the English Gigaword corpus (LDC2003T05) as well as new data from July 2002 through Dec 2004. Some minor updates to these documents have been made; namely, the text portions of "story" type documents have been line-wrapped such that each line does not exceed 80 characters. Documents of the other types have not been modified. The corpus contains five distinct international sources of English newswire:

Agence France Press English Service (afe)
Associated Press Worldstream English Service (apw)
Central News Agency of Taiwan English Service (cne)
The New York Times Newswire Service (nyt)
The Xinhua News Agency English Service (xie)


The Hong Kong University of Science and Technology (HKUST) collected and transcribed 200 hours of Mandarin Chinese conversational telephone speech from Mandarin speakers in mainland China. HKUST Mandarin Telephone Speech, Part 1 contains the training and development sets with 873 and 24 calls, respectively.

All calls were operator-assisted, namely, an operator would call two participants as scheduled to initiate a call. Subjects were asked about demographic questions before they were bridged for normal conversation. Their answers to the demographic questions were recorded on separate files. Subjects were allowed to talk up to 10 minutes. With a few exceptions, most calls are of the maximum length. Each side of a call was recorded on a separate wav file, sampled at 8 bits (a-law encoded), 8Khz.


HKUST Mandarin Telephone Transcript Data, Part 1 is the corresponding transcription for HKUST Mandarin Telephone Speech Data, Part 1. Standard simplified Chinese characters, encoded in GBK (CP-936), were used. The transcribed speech was segmented at natural boundaries wherever possible and each segment is no more than 10 seconds long. The Chinese text is not segmented into words, though there are occasional white spaces within some turns. HKUST Mandarin Telephone Transcript Data, Part 1 is distributed via web-download.