There are many existing Chinese corpora (see the relevant section in this forum) which can be used directly. If you build your won corpus, you will need to segment the Chinese text into words using tools such as ICTCLAS. Some concordancers also require the Chinese data to be converted into Unicode.
If you are interested in collocation study, you will find unit A10.1 particular useful. But I think you will benefit from have a quick read through the whole unit 10 to see what corpora can help in language studies so as to establish a link between your research interests and what can actually be done with a corpus.
Thank you for your explanation and all! And now I know the next thing to do is to try to retain these terms ... There are some computer guys in our team and hopefully they will figure out how to do. Well maybe I should start to learn regular expression now. Really a big challenge for us...but big fun also!
For this specific part I've read Mr.Yang's textbook and the chapter concerned in Susan Huston's work . Any other works you could recommend to us?
Thank you for your patience!
I think technical terms like CO2,CH3CHO,NaCl should be retained in your corpus as abbreviations like these are a defining feature that distinguishes this type of text from many other genres. A one M word corpus can be used for collocation studies of course. I think you should read more about what other people have done in this kind of research to see the range of research questions that can be addressed.
Thank you very much Dr.Xiao! But what can we do with such kind of empirical formula of chemical substances like CO2,CH3CHO,NaCl? They keep on appearing in my corpus and WS5 seems to have failed to recognized them...
And my little raw corpus is just of 1.03million words, is it large enough for word frequency counting? I've tried to do some collocation research, but really don't know what to start with. Do you have any suggestions? Thx!
If, by the wording "nonverbal expressions", you mean tables and graphics (rather than gestures and facial expressions as in multimodal corpora) in your EST corpus, then you might find the following discussion of use: http://www.corpus4u.org/showthread.php?t=5026
A raw corpus, if it is large in size, can be userful in lexical studies: collocations, semantic prosodies, lexical bundles etc.. It can also be used for other kinds of research such as grammatical studies if you know how to extract patternings with the help of regular expressions, or discourse studies with the help of key word analysis.
My postgraduate years will take its rounds in China, and now I'm preparing my resume as a gonna-be exempt exam student. I've tried to build an EST corpus and learn how to make concordance and analysis upon concordance result(wordlist making,some collocation study). The exploring has been fanscinating. I'd like to know what more I can do with a raw corpus? And how we deal with the nonverbal expressions (as are more frequently appeared in EST)in a corpus?Looking forward to your reply!
I'm a young soul in corpus linguistics' world, and I've reaaaally benefited a lot from this lovely forum when I was doing my USRP.It also reavls more of the colorful world to me Thank you for your work of maintaining and thriving this forum. And, Dr.Xiao, do you have any suggestions for a BA determing to devote her gratuate student's years into the course of corpus linguistics? Looking forward for your reply