Lancaster Corpus of Academic Written English

xiaoz

永远的超级管理员
Staff member
Lancaster Corpus of Academic Written English (LANCAWE) is an ongoing project. To find a description of the corpus and to access existing data (free):

http://www.ling.lancs.ac.uk/groups/slarg/lancawe/
 
Thanks a lot, Dr. Xiao, for providing us with the free nns written corpus. I find that some of the files when joined together can't display properly in concordancing as shown in the screen dump below. What's the reason for that? How do I solve that problem? Secondly, it is of interest to find that some files are incomplete, e.g. the article headed as <H>045-SS04-FA-F-PG-LIN-B-B-T3</H>. The article ends with "also" in the last line, which has no further content. Does it mean that the student has no time to continue writing it or something? What is the reason for that? Thanks a lot!
2006032316131847.jpg




[本贴已被 作者 于 2006年03月23日 16时13分26秒 编辑过]

[本贴已被 作者 于 2006年03月23日 16时15分55秒 编辑过]

[本贴已被 作者 于 2006年03月23日 16时22分10秒 编辑过]

[本贴已被 作者 于 2006年03月23日 16时25分08秒 编辑过]
 
The display problem is caused by character encoding. Try to download zipped archive wherever possible. If you save each file in your IE, pls go to VIEW - encoding and select Western European (Windows). The default setting is perhaps Chinese simpplified in your IE.
 
My IE is simplified Chinese. Yet though I changed it to Western European (Windows), the improper display of some words still persists. I don't know why it occurs. sigh!
Anyway, thanks a lot! How my second question? Why are some of the files iincompelete in the corpus eg the one mentioned above? Thanks a lot for your kind reply!!

[本贴已被 作者 于 2006年03月23日 20时40分42秒 编辑过]
 
Honestly I don't know. Check to see if the incomplete pieces are timed writings.
 
I am sorry to bother you again, Dr. Xiao. One more question, I find that the improper pdisplay is gone when I concordance the unzipped files. Why? Are there any mysteries ? Different mechanisms in storing them? Thanks a lot! Hope to see more files avaliable in zipped format soon!

[本贴已被 作者 于 2006年03月23日 20时59分32秒 编辑过]
 
When you save the text on a webpage, save it as an ANSI text.
 
Thanks a lot, Dr. Xiao! I have joined all the individual files into a big file that is stored in ANSI format. The display problem persists. Do you mean that before I join all the individual files, I have to make sure that all the files are dowloaded and saved in ANSI format? If it is the case, how exactly? I find that there is no asking whether I would like to save the file on a webpage as ANSI or UNICODE etc., the computer just asks me what I would like to name the downloaded file. Could you please to take the trouble to explain the procedure of saving it as ANSI on a webpage? Thanks a lot!
 
When you select File - Save as... in your IE, there are File name, Save as type, and Encoding for you to specify. If you have Chinese Simplified operating system, the default encoding is Chinese Simplified (GB2312), you should change into Western European (Windows).
 
回复:Lancaster Corpus of Academic Written English

以下是引用 清风出袖2006-3-24 8:09:37 的发言:
Thanks a lot, Dr. Xiao! Wow, it will be a lot of manual labor to save all the files one by one like that!

Xiao is right. Here a few tips from me:

1. Don't open the files in your internet browser, and simply RIHGT click the link to choose "Save target as" (or "Save link as" on some browsers), you'll get the files saved in their original encodings. However, in this way you still need download the files one by one. To save the trouble, see Method 2 below.

2. Make use of a download tool, such as Thunder (讯雷), Net Transport (音影传送带), Flashget, etc. to download all the files at one go. The 3 pictures below show how to use Thunder to download them (NOTE the setting showed in the pictures):

Right click an blank area the webpage, choose "Use Thunder to download all links":

2006032410470234.jpg



Then, make your choices to download txt and zip files only:

2006032410484292.jpg


2006032410493417.jpg




Finally, click OK to start to download. Good luck!
 
IMPORTANT:

Some servers and webmasters may block your IP for mass/batch downloading from their web sites. To avoid this, it's recommended to limit your "max simultaneous jobs" as "one" in downloading with any downloading tools. It's always a good practice to enhance Internet courtesy and netiquette.
 
Yes, dear laohong! I did it with Flashget. Dowload all the files in one go. Afer the downloading them all, the display problem persists when the concordancer works with those individual files while the concordancer has NO such problem running the unzipped files.
 
Quite strange. I tried just now, no problem here. Anyway, another solution is to make use of a code converter to convert all the downloaded files from GB to ANSI in one go.
 
There are a lot actually. You can find in the utilities bundled with Nanji Star Communicator, Twinbridge, etc. Here is a free one:

http://www.mandarintools.com/zhcode.html

Better use the "run off-line verion" to do batch conversion. Read the instruction there.
 
Back
顶部