回复: WECCL中的语料为什么有那么多重复的?
I used WECCL 1.0 in a project and found the same problem. Here is a paragraph from my manuscript for your reference:
With a script written to verify the integrity of the files in the corpus, we found 124 of the 3,678 files unusable. These include 1 file with no...