请教 小型语料库的详细步骤

回复: 请教 小型语料库的详细步骤

我是一名新手,想自建一个小型语料库。
请问建小型语料库的详细步骤。
谢谢!

最简单的是有语料有检索工具就可以凑乎着了,但还是建议你去看看书吧,那里会有"详细步骤",任何有关语料库建设的内容都会对你有帮助.
 
回复: 请教 小型语料库的详细步骤

第二章、文本采集与加工
2.1 文本采集
2.1.1创建自己的语料库
2.1.2使用现有的语料库
2.2 文本整理
2.2.1 清洁文本与问题文本
2.2.2 单个文本的整理
2.2.3 多个文本的批量整理
2.2.4 小结
2.3 元信息标注
2.3.1 元信息构成
2.3.2 标注语言
2.3.3 小结
2.4 分词、词形还原与词性赋码
2.4.1 分词
2.4.2 词形还原
2.4.3 词性赋码

这是大致步骤。
上面是我们编写的《语料库应用教程》的相关章节目录。供你参考。该书预计暑期会面市。
 
回复: 请教 小型语料库的详细步骤

Good news for all corpus practitioners. Will you share your design decisions for all your corpus tools made thus far in this book? Besides,is WilliamJia a co-author? There is much to learn from him regarding programming.
 
回复: 请教 小型语料库的详细步骤

William is not with us in this book. He is a much valued friend and partner in carrying on our corpus ideas.
The forthcoming book is co-authored by 梁茂成、李文中 and 许家金。

This book is not organized in the line of corpus tools, which should not be the main concern for corpus studies in the first place. I told my students, quite often, to forget about corpus tools and the technology part before they actually get into a corpus linguistic analysis. Corpus analyses should be led by a linguist, not a computer scientist. Nonetheless, it is certainly important to know what is possible technologywise on the part of the main investigator. Linguistics is at the heart of language corpus studies, and corpus technology is but the implementation or solutions, and most often the partial solutions.


BTW: What are "design decisions"? Did you mean the organization of the book?
 
回复: 请教 小型语料库的详细步骤

This book is not organized in the line of corpus tools
Thanks for the info.
Corpus analyses should be led by a linguist, not a computer scientist.
As a corpus is a neat tool for both linguists and computer scientists, I think I should reserve my opinion regarding this issue, although I myself a linguist by trade (and training).
Nonetheless, it is certainly important to know what is possible technologywise on the part of the main investigator. Linguistics is at the heart of language corpus studies, and corpus technology is but the implementation or solutions, and most often the partial solutions.
That is a true enough if we do not want to implement our ideas directly, but some of us do. So I can see no wrong in knowing more about the tech side, altough it means we have to split our time and energy between theory and practice, risking that we(I) might harvest nothing in both worlds.
BTW: What are "design decisions"? Did you mean the organization of the book?
By "design decisions", I intend to know more about the inner thinking when you design the BFSU series of corpus tools, for example, why you choose sentence as the basic unit of presentation in some of your programs (e.g. the FLERIC Learner Corpus Portal, Setence Collector) instead of paragraphs or other linguistic units.
 
回复: 请教 小型语料库的详细步骤

I started my CL journey with this book. It is really a good book. Thanks for your sharing.
 
Back
顶部