Got a copy of COLSEC corpus recently, when I tried to split teacher and students' utterances from the 302 files, some errors in the XML head (the first 4 lines of each file) were found. Post here for your reference (I was wondering whether the version I have is the final release):
As it claims, XML tags are very important for retrieval of data, however the tags there are not consistent:
In the 302 files, each file should have a <speaker info line, however, we can only find 299 "<speaker" (including 1 capital letter Speaker), and 298 </speaker>. That means, 3 files have no speaker info and 1 has no closing tag.
I only found 270 occurrences of "<interlocutor interlocutor=" and 270 cases of "> </interlocutor>". That means there are 52 files have no this line.
Similarly, found 303 "<participant". Believe one is not closing properly.
</participant> only 298 found, one is not closed properly, the other 3 were missing
<Transcription: altogether found 300, two were missing. Of the 300 found, spelling is not consistent, 264 are in capital letter T, the rest are not.
Similarly, 300 </transcription> were found (2 missing), but one in the first line of the text, though they are supposed to be in the last line of the files.
Inconsistent spelling of the tags are found here and there, for example,
Transscription (most are Transcription),
290 disno (but 9 discno)
Speaker (most of are speaker)
Interlocutor (most are interlocutor)
...
<speaker speaker1=male ...
speaker gender is given in the tag as above, however, some are in sp2=male.. format, 14 cases were found using speaker2=... instead of sp2=...format.
Some other problems as
<interlocutor interlocutor=?> </interlocutor> 14 cases
<interlocutor gender=?> <interlocutor>
the above case didn't follow the convention.
funny characters:
<Transcription id=0102 disno=01021122£-02£-0507>
And finally, numerous Chinese punctuation markers used in the texts...
As it claims, XML tags are very important for retrieval of data, however the tags there are not consistent:
In the 302 files, each file should have a <speaker info line, however, we can only find 299 "<speaker" (including 1 capital letter Speaker), and 298 </speaker>. That means, 3 files have no speaker info and 1 has no closing tag.
I only found 270 occurrences of "<interlocutor interlocutor=" and 270 cases of "> </interlocutor>". That means there are 52 files have no this line.
Similarly, found 303 "<participant". Believe one is not closing properly.
</participant> only 298 found, one is not closed properly, the other 3 were missing
<Transcription: altogether found 300, two were missing. Of the 300 found, spelling is not consistent, 264 are in capital letter T, the rest are not.
Similarly, 300 </transcription> were found (2 missing), but one in the first line of the text, though they are supposed to be in the last line of the files.
Inconsistent spelling of the tags are found here and there, for example,
Transscription (most are Transcription),
290 disno (but 9 discno)
Speaker (most of are speaker)
Interlocutor (most are interlocutor)
...
<speaker speaker1=male ...
speaker gender is given in the tag as above, however, some are in sp2=male.. format, 14 cases were found using speaker2=... instead of sp2=...format.
Some other problems as
<interlocutor interlocutor=?> </interlocutor> 14 cases
<interlocutor gender=?> <interlocutor>
the above case didn't follow the convention.
funny characters:
<Transcription id=0102 disno=01021122£-02£-0507>
And finally, numerous Chinese punctuation markers used in the texts...