在CQPweb中提取N-gram/cluster/chunk的做法

主题发起人 xujiajin
时间 2012-12-30

xujiajin

管理员

Staff member

2012-12-30

#1

如果要提取bi-gram，在Query mode:为[CQP syntax]的情况下，查询 [word=".*"] [word=".*"],或[word=".*"]{2}即可。同理可得三词四词ngram。

BFSU CQPweb 北外CQPweb在线语料库检索系统上线
http://111.200.194.212/cqp/

LOCNESS语料库的替代品NESSIE语料库
http://111.200.194.212/cqp/

Last edited: 2016-05-02

S

snow623

2012-12-30

#2

回复: 在CQPweb中提取N-gram/cluster/chunk的做法

在CQPweb中，想查看动词的使用情况，如何检索所有动词的词频，看了simple query syntax, 例子中有检索反身代词的_PNX，试了检索不出反身代词，关于动词尝试了:\S+_V\w+\s， _VERB， _{VERB}，都检索不出来，不知如何检索？CQPweb的检索好像和之前学的正则表达式不太一样，不能用！！

xujiajin

管理员

Staff member

2012-12-30

#3

回复: 在CQPweb中提取N-gram/cluster/chunk的做法

http://www.bfsu-corpus.org/static/corpus_tools/CQPweb_guide.pdf
链接中pdf文档末了有高级检索的范例。CQP syntax不同于一般的正则表达式，但正则表达式能实现的，CQP syntax差不多也都能做到。

检索所有动词，在选择Query mode为[CQP syntax]的情况下，输入[pos="V.*"]。说起来，CQP syntax比起一般的Regex，更容易理解一些。
比如，名词[pos="N.*"]，也就可以了

Last edited: 2016-05-02

xujiajin

管理员

Staff member

2012-12-30

#4

回复: 在CQPweb中提取N-gram/cluster/chunk的做法

作者 snow623:
在CQPweb中，想查看动词的使用情况，如何检索所有动词的词频，看了simple query syntax, 例子中有检索反身代词的_PNX，试了检索不出反身代词，关于动词尝试了:\S+_V\w+\s， _VERB， _{VERB}，都检索不出来，不知如何检索？CQPweb的检索好像和之前学的正则表达式不太一样，不能用！！

例子中给的应是用CLAWS5，C5标注的，反身代词的code是PNX。

我们的BFSU CQPweb上都是用C7标注的，检索反身代词应该用[pos="PPX.*"]
http://ucrel.lancs.ac.uk/claws7tags.html

检索出反身代词后，还可以试试Frequency breakdown功能。

X

xiaoz

永远的超级管理员

Staff member

2013-01-08

#5

回复: 在CQPweb中提取N-gram/cluster/chunk的做法

To extract 3-word clusters excluding punctuation marks, try the following pattern:

[word="\w*"]{3}

xujiajin

管理员

Staff member

2013-01-09

#6

回复: 在CQPweb中提取N-gram/cluster/chunk的做法

Re: xiaoz
Hi Richard, in your book review on McEnery and Hardie's book, you pointed out that CQPweb is not helpful to doing cluster/chunk and MD analysis.

In my view, as we could use the regex-like CQP syntax, cluster extraction and MD analysis with CQPweb should not be a problem.

X

xiaoz

永远的超级管理员

Staff member

2013-01-09

#7

回复: 在CQPweb中提取N-gram/cluster/chunk的做法

Yes CQP is making it possible to analyse word clusters in supported corpora. What I meant in my review refers to ordinary users' own corpora.

But probably even this is changing, as some online systems have started to allow users to upload their own corpora for analysis. For example, Leeds University's Intellitext Corpus Queries system allows users to build their own corpora. The system supports a number of languages including Chinese, and it even includes Biber-style MDA alaysis (called Multivariate analysis) - though the results are not always easy to interpret.

http://smlc09.leeds.ac.uk/itweb/htdocs/Query.html

作者 xujiajin:
Re: xiaoz
Hi Richard, in your book review on McEnery and Hardie's book, you pointed out that CQPweb is not helpful to doing cluster/chunk and MD analysis.

In my view, as we could use the regex-like CQP syntax, cluster extraction and MD analysis with CQPweb should not be a problem.

xujiajin

管理员

Staff member

2013-01-09

#8

回复: 在CQPweb中提取N-gram/cluster/chunk的做法

Thanks for the reply and for sharing the Leeds query system.

You must log in or register to reply here.

Share:

Reddit Pinterest Tumblr WhatsApp Email 链接

顶部