What is a corpus?

xujiajin · 2005-06-20

The word "corpus", derived from the Latin word meaning "body", may be used to refer to any text in written or spoken form. However, in modern Linguistics this term is used to refer to large collections of texts which represent a sample of a particular variety or use of language(s) that are presented in machine readable form. Other definitions, broader or stricter, exist. See, for example, the definition in the book "Corpus Linguistics" by Tony McEnery and Andrew Wilson or read more about different kinds of corpora in the Systematic Dictionary of Corpus Linguistics.

Computer-readable corpora can consist of raw text only, i.e. plain text with no additional information. Many corpora have been provided with some kind of linguistic information, here called mark-up or annotation.

xiaoz · 2005-06-20

Corpus markup and annotation have sometimes been used interchangeably in the corpus literature, though I choose to maintain a distinction between the two.

hancunxin · 2005-06-30

入门必读的书能否推荐几本？

xujiajin · 2005-06-30

一位好友知道我对语料库语言学很感兴趣，询问国内有什么比较好的语料库语言学的导论性质的英文原著可以拿来一读。虽然我也读过几本相关的书，然而一时也未能答得上来。我答应过些时候会给他一个答复。这便是本文得缘起。希望这篇文章可以解答这位好友的问题，也可以作为希望了解语料库语言学的研究者们提供一个参考。
目前国内由上海外语教育出版社和外语教学与研究出版社共引进了4本英文原版的语料库语言学的专著和论文集。它们是1）John Sinclair的Corpus, Concordance, Collocation；2）Graeme Kennedy的An Introduction to Corpus Linguistics；3）Douglas Biber, Susan Conrad, and Randi Reppen的Corpus Linguistics和4）Jenny Thomas and Mick Short所编的论文集Using Corpora for Language Research。其实还应算上Biber等人所编纂的巨著Longman Grammar of Spoken and Written English，但严格说来，它只是利用语料库生产出来的一个产品，算不得语料库语言学的理论作品。
1）成书较早（1991），主要是基于COBUILD项目的借助检索（concordancing）的办法来研究英语的搭配问题。1）内容较为局限，不是本文的重点。4）是论文集也不是本文讨论的重点。本文将重点讨论2）、3）和笔者手头刚刚拿到的剑桥大学出版社2002出版的由Charles Meyer编写的English Corpus Linguistics: An Introduction。

xujiajin · 2005-06-30

其他的还有
Corpus Linguistics at Work
Lexis in Contrast
Small Corpus
Corpora in Applied Linguistics
Tony McEnery. 1996. Corpus Linguistics.
Stubbs, Michael. 1996. Text and Corpus Analysis
International Journal of Corpus Linguistics
还有一部分人主要从事计算语言学研究的。。。
Register variation 是近来语料库语言学中的一个热点。
1992，1991两本必读书
Corpus annotation
Learner corpus on Computer

hancunxin · 2005-07-01

many thanks to xujiajin's detailed instruction.

hancunxin · 2005-07-01

would you please make a systematic recommendation to Chinese books on corpus linguistics as well?

hancunxin · 2005-07-01

）John Sinclair的Corpus, Concordance, Collocation；2）Graeme Kennedy的An Introduction to Corpus Linguistics；3）Douglas Biber, Susan Conrad, and Randi Reppen的Corpus Linguistics和4）Jenny Thomas and Mick Short所编的论文集Using Corpora for Language Research。以上四本书，我在武汉光谷书城见到过，其他的书好象没有。武汉的朋友有书籍信息，一定要共享啊！

xujiajin · 2005-07-01

回复：What is a corpus?

以下是引用 hancunxin 在 2005-7-1 17:18:02 的发言：
）John Sinclair的Corpus, Concordance, Collocation；2）Graeme Kennedy的An Introduction to Corpus Linguistics；3）Douglas Biber, Susan Conrad, and Randi Reppen的Corpus Linguistics和4）Jenny Thomas and Mick Short所编的论文集Using Corpora for Language Research。以上四本书，我在武汉光谷书城见到过，其他的书好象没有。武汉的朋友有书籍信息，一定要共享啊！

以上4本是国内引进过来的，所以一般外文书店都会有。其他的书就只能去各大图书馆查查看了。

xujiajin · 2005-07-01

中文撰写的语料库语言学的书国内到的确有几本。当然，你要是能读英文的话，建议你还是先读英文的。因为你读完英文的书之后，就会发现中文书中所写内容相当大篇幅都是从英文著作中引介过的。

何安平著语料库语言学与英语教学 2004 外研社
语料库语言学导论杨惠中主编 ; 卫乃兴等编著专著 2002 上海外语教育出版社
语料库语言学 / 黄昌宁，李涓子著专著 2002 好像是商务印书馆的。一本小册子很薄。该书更偏重技术层面的内容，因为作者是工科北京的计算语言学研究者。

我见过的好像就这么3本。

大家还是看看的。这本书主要是杨惠中老师的几个弟子写的。

[本贴已被作者于 2005年07月01日 21时22分42秒编辑过]

hancunxin · 2005-07-04

情况不容乐观。

hancunxin · 2005-07-04

我们老师说语料库原著比较难懂，还是先弄本中文的读读，尝尝滋味再说。我现在才开始读杨惠中教授的那本。不过，该书71面，有一个非常有用的赋码程序的连接被我证明是无效的。

hancunxin · 2005-07-04

也就是李文中的个人网页，大家来检验一下。 http://grwy.online.ha.cn/liwenzhong

xujiajin · 2005-07-04

回复：What is a corpus?

以下是引用 hancunxin 在 2005-7-4 12:51:09 的发言：
我们老师说语料库原著比较难懂，还是先弄本中文的读读，尝尝滋味再说。我现在才开始读杨惠中教授的那本。不过，该书71面，有一个非常有用的赋码程序的连接被我证明是无效的。

既然是外语系的，就真的建议你不要读中文书了。其实一般introduction to corpus ling的书都不会太多设计技术内容的，比较好懂的。

Victoria Thompson · 2014-09-04

回复: What is a corpus?

A linguistic Corpus is a wide situated of genuine cases of utilization of a dialect. Cases may be in content or sound structure.

What is a corpus?

xujiajin

管理员

xiaoz

永远的超级管理员

hancunxin

Moderator

xujiajin

管理员

xujiajin

管理员

hancunxin

Moderator

hancunxin

Moderator

hancunxin

Moderator

xujiajin

管理员

xujiajin

管理员

hancunxin

Moderator

hancunxin

Moderator

hancunxin

Moderator

xujiajin

管理员

Victoria Thompson