Wikipedia XML Corpus (free access and download)

laohong

管理员
Staff member
"We propose a XML corpus based on Wikipedia. This corpus can be used in a large variety of XML IR tasks like ad-hoc retrieval, categorization, clustering or Structure Mapping task. This corpus will be used for both, INEX 2006 and the XML Document Mining Challenge. You can find a description of the corpus in the technical report."

You can access to the collections here:
http://www-connex.lip6.fr/~denoyer/wikipediaXML/browse.html
 
A very good news, but wikipedia has been blocked by Chinese government early this year.
 
不过该语料库收录的中文方面的材料看起来也不少:

Id Title Versions Size
12 数学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 14012
19 哲学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 27307
21 文学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 22806
23 历史 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 14067
24 计算机科学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 30890
39 民族 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 45155
46 戏剧 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 23498
48 电影 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 9707
52 音乐 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 16282
54 经济学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 10730
56 政治学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 22864
57 法学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 14136
60 社会学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 51180
62 军事学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 9937
68 物理学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 19989
70 天文学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 35138
72 力学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 8248
74 化学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 18162
76 地理学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 20876
78 地质学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 10932
79 气象学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 9935
82 生物学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 19479
84 心理学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 17604
86 中医学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 39901
87 海洋学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 4944
89 水文学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 16694
94 测绘学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 2420
100 农业 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 7991
106 统一资源定位符 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 11151
107 首页/old XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 5700
111 数据结构 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 26574
112 X\算 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 21109
113 设计模式 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 656
118 中华人民共和国 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 128062
124 Self XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 9685
125 克利斯登?奈加特 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 21965
129 Linux XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 36021
130 Linux内核 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 9114
132 黑客 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 11454
133 林纳斯?托瓦兹 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 8700
135 理查德?马修?斯托曼 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 17952
136 自由软件基金会 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 12138
137 2003年7月 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 77645
140 操作系统 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 21763
141 操作系统列表 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 17166
142 GNU/Linux XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 7686
147 中国历史 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 100657
149 GNU XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 9399
153 自由软件运动 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 3439
157 材料科学 XML HTML (generated from XML using XSLT) HTML Wikipedia Page WikiText 9817
 
回复:Wikipedia XML Corpus (free access and download)

很好的东西,其实http://www-connex.lip6.fr/~denoyer/wikipediaXML/cgi-bin/listAllDocuments.php?lang=zh提供还不是语料的全部,该页面上的最大ID="157",如果将id递增,比如"280"就可以得到“历史年表”http://www-connex.lip6.fr/~denoyer/wikipediaXML/cgi-bin/getXML.php?lang=zh&id=280
看来这个语料的潜在资源还需要挖掘。不过要先谢谢laohong
 
想起两个问题,一、这些语料的真实性如何,即归于native speaker的,还是学习者这一类?二、这些语料虽然为XML格式,除了段落等标记外,不少其它标注成了噪音,怎么处理?
 
的确如此。另外,看到上面的语料名称好像很多都像是香港那边的汉语命名方式,不知道他们语料搜集的范围是否涵盖了大陆汉语的使用情况。谢谢laohong!
 
Back
顶部