The Lancaster Los Angeles Spoken Chinese Corpus (http://www.lancaster.ac.uk/fass/projects/corpus/LLSCC/) has about 1 million words. If it's still too small, perhaps you could try to create your own, using some existing data, such as TV shows, movie transcripts, etc. If you're interested in...