Chinese Gigaword Second Edition

Haiyang Ai

Administrator
Staff member
#1
Chinese Gigaword Release Second Edition is a comprehensive archive
of newswire text data in Chinese that has been acquired over several
years by the LDC. This release includes all of the contents in the first
release of the Chinese Gigaword corpus (LDC2003T09), material from
one new source, as well as new materials from the other two sources.
Thus, the corpus contains three distinct international sources of Chinese
newswire - Central News Agency, Taiwan, Xinhua News Agency, and
Zaobao. Some minor updates to the documents from the first release
have been made.

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14
 

Haiyang Ai

Administrator
Staff member
#3
Chinese Gigaword is a commercial corpus.
Gigaword here just indicate its corpus size, I believe, not quite sure about the original idea though.
 

xujiajin

管理员
Staff member
#4
回复: Chinese Gigaword Second Edition

An update

Tagged Chinese Gigaword
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2007T03

Chinese Gigaword Third Edition
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2007T38

Gigaword family has several members of media language corpora:
Catalog Number Corpus Name
LDC2003T12 Arabic Gigaword
LDC2006T02 Arabic Gigaword Second Edition
LDC2007T40 Arabic Gigaword Third Edition
LDC2003T09 Chinese Gigaword
LDC2005T14 Chinese Gigaword Second Edition
LDC2007T38 Chinese Gigaword Third Edition
LDC2003T05 English Gigaword
LDC2005T12 English Gigaword Second Edition
LDC2007T07 English Gigaword Third Edition
LDC2006T17 French Gigaword First Edition
LDC2006T12 Spanish Gigaword First Edition
LDC2007T03 Tagged Chinese Gigaword

English Gigaword Third Edition
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T07
 
#5
Would you please let me know how I could use this corpus? I've tried using wordsmith, but the Chinese characters would not show up on the screen... Sorry if my question sounds idiotic...I'm not a corpus linguist and am just trying to grapple with the tools in my research...

In addition, I can't seem to find the information on the duration of this corpus...Would you perhaps be able to help?

Thanks a lot in advance!
 
顶部