[Weird crap] What would you say about such a toy?

dzhigner

Moderator
This thing, which I call a "GugleExtractor" is born out of my craziness for digging up things on Internet. The GugleExtractor, as it sounds like, is a tool that extracts text-only snippets out of Google results, which I've assumed not totally insuficient. And I've kind of worked up a weird algorithm to denoise the extracted lines, by simply giving up those unclean. To each extracted snippet I attached a button. A onclick browser window pops up with a text-only page from which the snippet is quoted. This function comes handy when an apparently useful result doesn't have adequate cotext. When it comes to the speed, I'd dare say it gets its job done in a matter of seconds, even when all results are exhausted.

About two years ago, Google began to come in handy for me when it came to my realization that my English sucks and regular corpora let me down in providing sufficient clues when I write in this language. Just another day, I came up with the idea of GugleExtractor, tired of staring at gazing at those jumbled Google snippets, and now I've got one. I made such a boast about this stuff as a trial run, and turns out it's worth me staying up at this ridiculously late hour.

So what would you say about such a toy?
2005111403534162.jpg
 
very useful in extracting texts from the internet to build a web-as-corpus archive.
can you kindly provide a trial version?
 
Isn't it better to make the right window to display the output in terms of concordances?

[本贴已被 作者 于 2005年11月15日 01时17分49秒 编辑过]
 
回复:[Weird crap]

It's being debugged and will be fininshed within a couple of days.
I should make it clear that this tool can only extract goolge snippets. It can be used to collect a sample, but only when it still works if the context is small.
2005111504504256.gif

This tool cuts a snippet apart at "...". Take the following snippet as an example.
2005111505020224.gif

Such a snippet is to be cut into 2 lines, for they are not continuous, taken from different parts of a file or a page, in which case the length of cotext/context is by no means guaranteed. This is why most Web-as-corpus tools should download the pages.
A tip: by adding one or two "*", the snippet can be extended.

I am kind of thinking about making a concordancer out of this tool.
 
GugleExtractor V.1 crafted by Ding Zheng
http://www.corpus4u.org/upload/forum/2005112103282754.rar
This tool is still very primitive. It is programmed in VB.net. To run it, .NET framework 1.1 or higer is required.
1. Use the Google Form on the left to search.
2. When the first result page is done, menu item "Extract" will be enabled.
3. Use menu item "Extract" to extract title, snippet and/or URL of each result returned by Google. Lines which don't contain any key word won't be displayed.
4. Specify the maximum of results and minimum length of each text line, but this should be done before search & extract.
5. Choose the components of a result item, including title, snippet, URL and "icons", which are actually add-ons to open the original page or a text-only page which is done by a CGI at HTTP://www.WebCorp.org.uk. This CGI only convert HTML. It won't convert PDF or anything for that matter.
6. "Browser" menu only works with the browser on the left, the result browser can't go back or go forward.
7. Only when a search is done with the search form this tool provides, a extract can be started. So a new extract can be prepared in two ways: use "Reset" or use "Browser>Back" to return to the search form and do a search, then the menu item "Extract" will be enabled.
8. Although this tool is meant only to extract English text, it works with other languages, but a simple filter function won't work.
I truly need suggestions and actually I can't give any guarantee that it won't go wrong. In case error occurs the tool shuts down and won't cause any trouble. By the way, if you don't like the picture of Diogenes, just delete it.
 
The text lines are, more often than not, still kind of noisy and even worse if shown up in the form of centered KWIC. So I didn't make it a KWIC concordancer. I just left the lines the way they are, and of course. kind of denoised. But the denoising part only works well (and actually not very well) with English. Anyway, it comes in handy when some hypothetical expressions or collocations need to be tested.

This tool is the first trial. Whenever I come up with new idea I will update it.


[本贴已被 作者 于 2005年11月22日 03时46分57秒 编辑过]
 
Back
顶部