WEBCORPCTHE WEB AS CORPUS

xujiajin

管理员
Staff member
http://www.webcorp.org.uk/

What is WebCorp?
However large and up-to-date the electronic text corpora available are, there will always be aspects of the language which are too rare or too new to be evidenced in them. WebCorp is a suite of tools which allows access to the World Wide Web as a corpus - a large collection of texts from which facts about the language can be extracted.

Who can use WebCorp?
WebCorp can be used by anyone who has an interest in language and how particular words and phrases are used, especially words and phrases which are too new or too rare to appear in any dictionary or standard corpus. Since its launch, WebCorp has been used by corpus linguists, lexicographers, language teachers and learners, publishers, journalists, advertisers, and researchers in a variety of fields. Although WebCorp is designed for linguistic data search, many users have found its results format (with relevant sections of text from multiple web pages collated on one page) useful for information retrieval of the type for which standard search engines are usually used.

*************
[探讨] Google As a Corpus Tool
http://www.corpus4u.org/showthread.php?t=94

Google as a Quick and Dirty Corpus Tool
http://www.corpus4u.org/showthread.php?t=323

为Google检索重做的表单页面
http://www.corpus4u.org/showthread.php?t=508

PPTs for Web-as-A-Corpus workshop
http://www.corpus4u.org/showthread.php?t=560
 
本人有一种把统计方法应用到WEB as a corpus 方法的想法,跟大家讨论一下,向大家请教。
用WEBCORP作搜索引擎; webcorp有一个很有意思的CGI,把提取到的HTML转换成Text,其URL是:http://www.webcorp.org.uk:80/cgi-bin/webparse.nm?可以从客户机上调用这个CGI,后面加上参数urlstring就可以了比如:http://www.webcorp.org.uk:80/cgi-bi...g=http://subscribe.free.fr/pperso/ungiga.html

把webcorp提取并转换为文本的文件全部下载,经过筛选后视为一个样本总体,并在此基础上进行统计。比如,对两个词进行搭配研究,搜索两个节点词呈“并”关系的文档(即两个词都包括),对每个文档里两词间的距离求平均数、方差等数据,反映其集中或分散程度。这种统计方法我认为比较稳妥,但是用互信息、T-检验等统计手段是否合适?请教语料库统计的高手。

为了方便下载,我制作了一个提取结果页面上所有"plain text"链接URL的小工具,把所有的URL拷贝到字处理软件里制成url列表,导入像winhttrack这样的批量下载工具,下载之后用MONOCONCPRO或者WordSmith读多/大文件能力强的分析工具进一步采样,统计词频,收集数据。然后进行进一步统计。。
工具实则为一动态HTML页面,源码如下:
<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<title>WebCorp_Plain_Text_URL_Extractor</title>

<script LANGUAGE="VBSCRIPT">

Function TXT2HTMLDOC(ByVal Data)
dim OBJ
On Error Resume Next
SET OBJ = CreateObject("htmlfile")
OBJ.open()
OBJ.write(Data)
SET TXT2HTMLDOC=OBJ
End Function

Sub Button1_Click(ByVal DATA)
Dim K,J,EL,AL,I
SET K = TXT2HTMLDOC(DATA)
If Not k Is Nothing Then
SET j = k.body.all
If Not j Is Nothing Then
For Each el In j
If el.tagName = "A" Then
If Trim(el.innerText) = "Plain Text" Then
SET al = el
al.protocol = "HTTP:"
al.hostname = "www.webcorp.org.uk"
al.pathname = "/cgi-bin/webparse.nm"
DOCUMENT.WRITE(al.HREF & "<BR>")
End If
End If
Next
End If
End If
End Sub
</script>
</head>

<body>
<font face="Verdana"><b>PASTE THE HTML SOURCE OF WEBCORP RESULTS IN THIS TEXTAREA, THEN CLICK "EXTRACT"</b></font><p>&nbsp;<br>
<textarea rows="17" name="S1" cols="83"></textarea>
<input type="button" value="EXTRACT" name="BUTTON1" onclick="BUTTON1_CLICK(S1.innerText)">
</p>
</body>

</html>
 
Not exactly. The tool I made is meant to be used with just WebCorp. The tool only extracts all URLs linked to text-only pages generated by WebCorp text converter (Html to text), then those text-only pages are downloaded and serve as a corpus sample. On this basis, some statistical analysis can be performed.
 
回复:WEBCORPCTHE WEB AS CORPUS

Hi, Colleagues

If you want understand how works www.corbalex.com, reading the following explanation.

HTA Applications can only be run on Microsoft Windows. They are rather like extended web pages. The run without the same security limitations as .html or .htm files - for instance you can use activeX objects without a security warning to read/write files and folders on a user's computer.

Essentially the HTA application is like an html/script programmed executable offering the ability to use the various windows scripting technologies, as well as all the features of internet explorer.

CorBaLEx is, pehaps, the unique computational tool for doing collocations from webpages - online.

Further details can be obtained from: http://msdn.microsoft.com/library/d...htaoverview.asp

Hope this helps

Lebron Letchev

以下是引用 xiaoz2005-8-20 0:16:26 的发言:
dzhigner, are you doing something like this - how to extract collocations from webpages?

http://www.corbalex.com/
 
回复:WEBCORPCTHE WEB AS CORPUS

Would it be desirable to have some of these general functions builtin:
- Back (after user reads one of the links) and
- Print/Save (to save the page that user is interested in)?

以下是引用 dzhigner2005-8-17 11:48:11 的发言:
GOOGLE SEARCH FORM REVISED
我把之前做的Google搜索表单完善了一下,并且加入了 Google Scholar 和 Google Print。
http://www.corpus4u.org/upload/forum/2005082214363671.rar
 
回复:WEBCORPCTHE WEB AS CORPUS

以下是引用 dzhigner2005-8-21 1:12:05 的发言:
Not exactly. The tool I made is meant to be used with just WebCorp. The tool only extracts all URLs linked to text-only pages generated by WebCorp text converter (Html to text), then those text-only pages are downloaded and serve as a corpus sample. On this basis, some statistical analysis can be performed.

How do you do statistical analysis based on the search results such as this:
2005082701381648.jpg


I have incoproated this tool into the next version of ACWT, but I don't know how
different it is from similar Google search.
 
To: 动态语法
This one is nothing but a regular Google search tool. It is not this tool that is involved with statistical analysis.
 
回复:WEBCORPCTHE WEB AS CORPUS

got it. so the script you posted above and the HTA file are two different things.
 
Back
顶部