Some of them are control characters used in file transmission typically found in newswire texts. They are not visible. One of such characters which took me hours today is a non-DOS new line character. When the text is opened in Notepad, the text is in one line, but when it is opened in a concordancer, there are two lines. As that character cannot be copied and pasted into a search/replace function, and there is no way to enter it on keyboard, it can not be easily removed by searching and replacing with a null character. When I found what that really was, I saved the file as Unicode, this time it is easy to remove such blank lines. The data was then converted back to GB2312 for processing with ICTCLAS. There are many other characters of different natures.
I would suggest that you split the file that causes problem into halves, and then test it with ICTCLAS, ignore the half that can be processed, split the problematic part into halves again and test, until you find the root of the problem. Then solve that problem in the whole corpus - normally data from the data source just has the same 1-2 problems.