[纠错]CLEC在转写、标注方面的一些问题

回复:[原创]CLEC st 3和st 4子库一些格式上的纰漏

关于CLEC子库大学学习者英语语料库的抽样在《语料库语言学导论》中有所介绍。当时抽取试卷作文部分时,通过四六级考委的帮忙,直接到旧试卷仓库中抽的,各省的卷子都有,方法是每隔10本抽一本,滤除掉6分以下的作文,共抽了2000多篇。正式抽样前先做了试抽样。自由作文部分相对集中一些,郑州几个高校有一些,河南师大有一些,再就是上海、广州几个高校。后来又补充进一些清华大学的自由语料。自由语料的整个搜集过程算不上随机抽样,主要是人力物力达不到。以后条件成熟,可以大规模组织人力抽样,这样代表性可能会更强些。
文本格式确实存在所提到的问题。空行问题是word在转换纯文本格式是加进去的换行符,虽然已花了很多精力进行修正,遗漏仍然很多。在近期开发的口语英语语料库中已避免了部分问题。濮建忠博士已组织人手开始纠正。非常感谢提出意见。
wzli
 
回复:[原创]CLEC st 3和st 4子库一些格式上的纰漏

另外,部分annotation marker在不同的子库中不一致,如在st3和st4中有<title>标记,在st2中却没有,而很明显,st2中的文章是有title的,很多文章的第一句就是title。
 
回复:[原创]CLEC st 3和st

With the availability of new corpus annotation tools, many of the
tasks required of tagging the corpus can be automated or semi-automated,
which can help reduce error rates.
 
回复:[纠错]CLEC在转写、标注方面的一些问题

以下是引用 tiger2005-7-10 8:18:59 的发言:
再加一点:
每个text没有end tag,所以用wordsmith的splitter工具分割出各个text之前,得自己加入自己设定的end tag。

i did not know what an"end tag" was before? i want to know how to make an end tag now.can you give me an exmple?
 
回复:[纠错]CLEC在转写、标注方面的一些问题

for example, a text with an end tag goes like this:<st3>...<st3>
but in st3 of clec it is only <st3>...
 
回复:[纠错]CLEC在转写、标注方面的一些问题

以下是引用 hancunxin2005-9-3 8:50:55 的发言:
以下是引用 tiger2005-7-10 8:18:59 的发言:
再加一点:
每个text没有end tag,所以用wordsmith的splitter工具分割出各个text之前,得自己加入自己设定的end tag。

i did not know what an"end tag" was before? i want to know how to make an end tag now.can you give me an exmple?


my way of doing that with clec is quite clumsy--first convert the .txt files to .doc and then use "find and replace" to add <st3> to the end of each text in st3 of clec.
 
回复:[纠错]CLEC在转写、标注方面的一些问题

以下是引用 tiger2005-7-19 19:42:21 的发言:
There are some texts that are copied from the textbooks, as the compilers have stated, especially in st2.
there are too many texts with the same tiltles, but I have not found any repeated text so far.

I have found some repeated texts in ST2. I wondered if they had copied each other's diaries. have a look!


Sunday Nov 3 Sunny
Today I see [vp6,2-3] the "Guangzhou newspaper". There's one news said" Six years ago there were three women use their blood to pretend [fm2,-] our country's moneys [np5,1-0] Miss Bai Hazi's daughter had [vp6,20-1] become a poor [wd2,2-0]. But there was many people help [vp4,4-1] her. Mr Zi Shui is [vp6,11-3] One f them Although [fm3,-] he was very poor [sn8,15-9] , he sent money to her for a long time. Little Hong Lian has [vp6,3-2] found [wd3,3-7] her [pr3,22-0] for a long time , but he didn't [vp9,18-8] say he was who [pr5,5-2] HongLian found [wd3,-][vp6,20-0] ". And last year he was [vp7,1-1] died from car, then the people knew e [fm1,-] was "Jin Shui".
Sunday Dec 1 Sunny
Today I have seen the "GuangZhou Newspaper". The newspaper is very interesting. There is a [np7,0-1] account of them I like to read. It's said fifty years ago, when we fight for [vp2,1-3] the Japen [fm2,-] enemy. There's [np3,0-1] some soldier [np6,1-0] make up a [np3,4-2] bad man [np3,8-0] to go into [wd3,16-3] the enemy's army[sn8,s-]. They found some imformation [fm1,-] for our army [sn8,6-7] , these helped to push back the enemy. They are [vp6,7-0] very clever and brave, some of them have given their lives to our country. So we must remember them for ever.
Sunday Nov. 24. Sunny
I like play [vp5,1-2]table Tennis. Because my eyes are near-sighted, [fm2,-] play table tennis can make me judge the ball's speed and adirection, [fm1,-] the decid [wd7,1-4] me what to do. Then my eyes would [vp6,25-0] concentrate, and it [pr3,8-3] an be better, and play table tennis can make me clever.
Sunday Nov 3 Sunny
Today I see [vp6,2-3] the "Guangzhou newspaper". There's one news said" Six years ago there were three women use their blood to pretend [fm2,-] our country's moneys [np5,1-0] Miss Bai Hazi's daughter had [vp6,20-1] become a poor [wd2,2-0]. ut [fm1,-] there was many people help [vp4,4-1] her. Mr Zi Shui is [vp6,11-3] One f them Although [fm3,-] he was very poor [sn8,15-9] , he sent money to her for a long time. Little Hong Lian has [vp6,3-2] found [wd3,3-7] her [pr3,22-0] for a long time , but he didn't [vp9,18-8] say he was who [pr5,5-2] HongLian found [wd3,-][vp6,20-0] ". And last year he was [vp7,1-1] died from car, then the people knew e [fm1,-] was "Jin Shui".Sunday Dec 1 Sunny Today I have seen the "GuangZhou Newspaper". The newspaper is very interesting. There is a [np7,0-1] account of them I like to read. It's said fifty years ago, when we fight for [vp2,1-3] the Japen [fm2,-] enemy. There's [np3,0-1] some soldier [np6,1-0] make up a [np3,4-2] bad man [np3,8-0] to go into [wd3,16-3] the enemy's army[sn8,s-]. They found some imformation [fm1,-]for our army [sn8,6-7] , these helped to push back the enemy. They are [vp6,7-0] very clever and brave, some of them have given their lives to our country. So we must remember them for ever. (重复输入语料)
 
大家如果需要验证的话可以去 http://www.clal.org.cn/corpus/ChiSearchEngine.aspx
(注意请选择在“中学生”语料里搜索)
 
语料库的建设是一项意义重大但工作量巨大的事情,难免会出现一些小的失误。

我想询问,大家在语料库的的使用过程中,认为现行的赋码体系如何?哪些是最不常用的?
现行的赋码有哪些不足?如需增加或减少一些标注驸马,容易操作吗?
 
回复:[纠错]CLEC在转写、标注方面的一些问题

以下是引用 543212006-3-5 12:50:30 的发言:
语料库的建设是一项意义重大但工作量巨大的事情,难免会出现一些小的失误。

我想询问,大家在语料库的的使用过程中,认为现行的赋码体系如何?哪些是最不常用的?
现行的赋码有哪些不足?如需增加或减少一些标注驸马,容易操作吗?

基本够用。现行的附码有时不够具体,比如 VP6时态错误,如果能进一步表明是哪种时态错误岂不更好。增加或者减少不会很难实现。还有就是,错误分类还不够详细。有时候同一类错误,被标上不同的代码。比如: 时态错误的代码,有的人却标成拼写错误,或者构词错误。
 
Back
顶部