WECCL中的语料为什么有那么多重复的？

beginner · 2010-01-23

最近用WECCL中的记叙文语料做统计，发现里面的作文同是没有经过错误标注的也有很多重复的。请问这是怎么回事呢？

xflu76 · 2010-01-24

回复: WECCL中的语料为什么有那么多重复的？

I used WECCL 1.0 in a project and found the same problem. Here is a paragraph from my manuscript for your reference:

With a script written to verify the integrity of the files in the corpus, we found 124 of the 3,678 files unusable. These include 1 file with no header, 1 with two non-identical headers, 4 with only one sentence, 17 empty files, and 101 that duplicate other files. This leaves us 3,554 files to work with. The corpus has a total of 1,119,510 words, and the length of the individual essays ranges from 89 to 892 words (mean = 315, standard deviation = 87).

beginner · 2010-01-24

回复: WECCL中的语料为什么有那么多重复的？

作者 xflu76:
I used WECCL 1.0 in a project and found the same problem. Here is a paragraph from my manuscript for your reference:

With a script written to verify the integrity of the files in the corpus, we found 124 of the 3,678 files unusable. These include 1 file with no header, 1 with two non-identical headers, 4 with only one sentence, 17 empty files, and 101 that duplicate other files. This leaves us 3,554 files to work with. The corpus has a total of 1,119,510 words, and the length of the individual essays ranges from 89 to 892 words (mean = 315, standard deviation = 87).

Thanks a lot for your prompt response. My present study is only on narration. Then I have to delete the repeated files first. Just wondering why there is such a case which bring unconvinience in WECCL.

xflu76 · 2010-01-25

回复: WECCL中的语料为什么有那么多重复的？

Did you try WECCL 2.0? I haven't tried it myself yet, but these problems might have been solved there, as the compilers of WECCL 1.0 were aware of them.

作者 beginner:
Thanks a lot for your prompt response. My present study is only on narration. Then I have to delete the repeated files first. Just wondering why there is such a case which bring unconvinience in WECCL.

beginner · 2010-01-25

回复: WECCL中的语料为什么有那么多重复的？

作者 xflu76:
Did you try WECCL 2.0? I haven't tried it myself yet, but these problems might have been solved there, as the compilers of WECCL 1.0 were aware of them.

Thanks for your response. I have viewed 2.0, but the narration file is deleted. I don't know why.

xujiajin · 2010-01-25

回复: WECCL中的语料为什么有那么多重复的？

SWECCL2.0 is made up of a completely different set of data from SWECCL1.0.

But SWECCL1.0 does have a revised edition on one DVD, and the old SWECCL1.0 has three CDs.

brindyfu · 2010-02-20

回复: WECCL中的语料为什么有那么多重复的？

我看到WECCL1.0书上34-35页说有3880个文本，检查了一下电子语料库，也没问题，可为何大家都说只有3678篇呢？

WECCL中的语料为什么有那么多重复的？

beginner

xflu76

beginner

xflu76

beginner

xujiajin

管理员

brindyfu