COLSEC中的一些错误

laohong

管理员
Staff member
Got a copy of COLSEC corpus recently, when I tried to split teacher and students' utterances from the 302 files, some errors in the XML head (the first 4 lines of each file) were found. Post here for your reference (I was wondering whether the version I have is the final release):

As it claims, XML tags are very important for retrieval of data, however the tags there are not consistent:

In the 302 files, each file should have a <speaker info line, however, we can only find 299 "<speaker" (including 1 capital letter Speaker), and 298 </speaker>. That means, 3 files have no speaker info and 1 has no closing tag.


I only found 270 occurrences of "<interlocutor interlocutor=" and 270 cases of "> </interlocutor>". That means there are 52 files have no this line.

Similarly, found 303 "<participant". Believe one is not closing properly.

</participant> only 298 found, one is not closed properly, the other 3 were missing

<Transcription: altogether found 300, two were missing. Of the 300 found, spelling is not consistent, 264 are in capital letter T, the rest are not.

Similarly, 300 </transcription> were found (2 missing), but one in the first line of the text, though they are supposed to be in the last line of the files.

Inconsistent spelling of the tags are found here and there, for example,
Transscription (most are Transcription),
290 disno (but 9 discno)
Speaker (most of are speaker)
Interlocutor (most are interlocutor)
...

<speaker speaker1=male ...
speaker gender is given in the tag as above, however, some are in sp2=male.. format, 14 cases were found using speaker2=... instead of sp2=...format.

Some other problems as
<interlocutor interlocutor=?> </interlocutor> 14 cases
<interlocutor gender=?> <interlocutor>
the above case didn't follow the convention.

funny characters:
<Transcription id=0102 disno=01021122£-02£-0507>

And finally, numerous Chinese punctuation markers used in the texts...
 
回复: COLSEC中的一些错误

谢谢指正。以前看到过,只是因为没有实际使用,没想到问题比xiao提到的还严重。SGML的四行文字就有如此之多错误,正文中的就不想再提了。即使没有声音文件验证,随便做个Wordlist也不难看到一大堆的乱码.....还有020198.txt里面有两个Transcripts,也没有人发现吗?

做语料本身是非常辛苦的,错误总是难免的,但是有些错误只要稍微做个检索就可以查到的,只是这个版本似乎有点离谱......令人佩服的是在网上还看到有人用这样的Data发了几篇文章.......

总而言之,大家也讨论一年了,不知道哪位搞到了“洁本”?本来打算花点时间修一修,没想到一下午都没有搞完,只有决定放弃了.......
 
回复: COLSEC中的一些错误

这个帖子我觉得相关人士应当看看。引以为戒。
 
Back
顶部