Orthographic Errors in Web Pages - Towards Cleaner Web Corpora

laohong · 2006-12-08

Christoph Ringlstetter, Klaus U. Schulz and Stoyan Mihov. 2006. Orthographic Errors in Web Pages - Towards Cleaner Web Corpora . Computational Linguistics, September 2006, Vol. 32(3), pp. 295-340.

Abstract:

Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the distribution of orthographic errors of various types in Web pages. As a by-product, methods are developed for efficiently detecting erroneous pages and for marking orthographic errors in acceptableWeb documents, reducing thus the number of errors in corpora and linguistic knowledge bases automatically retrieved from the Web.

Full paper for downloading...
(unfortunately failed several times to upload the attachment. Pls download it from Corpus4U gmail account!)

armstrong · 2006-12-08

回复: Orthographic Errors in Web Pages - Towards Cleaner Web Corpora

thanks,Dr.Hong.

wangdw · 2006-12-10

回复: Orthographic Errors in Web Pages - Towards Cleaner Web Corpora

谢谢,期待着拜读全文．

oscar3 · 2006-12-10

回复: Orthographic Errors in Web Pages - Towards Cleaner Web Corpora

Thanks again for the copy.

wangdw · 2006-12-10

回复: Orthographic Errors in Web Pages - Towards Cleaner Web Corpora

作者 laohong:
Christoph Ringlstetter, Klaus U. Schulz and Stoyan Mihov. 2006. Orthographic Errors in Web Pages - Towards Cleaner Web Corpora . Computational Linguistics, September 2006, Vol. 32(3), pp. 295-340.

Full paper for downloading...
(Pls download it from Corpus4U gmail account!)

Would you please let me know where is the Corpus4U gmail account?
I am long for it.
Best wishes.

wangdw · 2006-12-19

回复: Orthographic Errors in Web Pages - Towards Cleaner Web Corpora

谢谢laohong!

清风出袖 · 2006-12-19

回复: Orthographic Errors in Web Pages - Towards Cleaner Web Corpora

our gmail account: corpus4u@gmail.com
pw: www.corpus4u.com

Orthographic Errors in Web Pages - Towards Cleaner Web Corpora

laohong

管理员

armstrong

高级会员

wangdw

初级会员

oscar3

高级会员

wangdw

初级会员

wangdw

初级会员

清风出袖

高级会员