Character encoding in corpus construction

xiaoz · 2005-07-05

Character encoding in corpus construction

This chapter first briefly reviews the history of character encoding. Following from this is a discussion of standard and non-standard native encoding systems, and an evaluation of the efforts to unify these character codes. Then we move on to discuss Unicode as well as various Unicode Transformation Formats (UTFs). As a conclusion, we recommend that Unicode (UTF-8, to be precise) be used in corpus construction.

http://www.lancs.ac.uk/postgrad/xiaoz/papers/character encoding.doc

动态语法 · 2005-07-06

回复：Character encoding in corpus construction

WordSmith 4 似乎只处理UNICODE文本的汉语
文件，UTF-8格式的汉语文本WS4不能处理。对否？

xiaoz · 2005-07-06

Exactly. WS4 is Unicode-based (UTF-16). If your Chinese data is encoded in UTF-8, WS4 will prompt you to convert it into Unicode. If the data in GB2312 or other native encoding, WS4 does not recognizes it and cannot deal it properly. UTF-8 text cannot be dealt with directly by WordSmith 4 reliably.

But it is adviable to backup your data before conversion as WS4 overwrites the data.

[本贴已被作者于 2005年07月06日 01时00分53秒编辑过]

xujiajin · 2005-07-06

回复：Character encoding in corpus construction

以下是引用 xiaoz 在 2005-7-6 0:50:08 的发言：
Exactly. WS4 is Unicode-based (UTF-16). If your Chinese data is encoded in UTF-8, WS4 will prompt you to convert it into Unicode. If the data in GB2312 or other native encoding, WS4 does not recognizes it and cannot deal it properly. UTF-8 text cannot be dealt with directly by WordSmith 4 reliably.

But it is adviable to backup your data before conversion as WS4 overwrites the data.

[本贴已被作者于 2005年07月06日 01时00分53秒编辑过]

Important warning! Thanks.

动态语法 · 2005-07-06

回复：Character encoding in corpus construction

以下是引用 xiaoz 在 2005-7-6 0:50:08 的发言：
Exactly. WS4 is Unicode-based (UTF-16). If your Chinese data is encoded in UTF-8, WS4 will prompt you to convert it into Unicode.

When does it warn you? When you use the conversion program? Otherwise it doesn't display any warning when you use its tools (concord, wordlist, etc.)

以下是引用 xiaoz 在 2005-7-6 0:50:08 的发言：

But it is adviable to backup your data before conversion as WS4 overwrites the data.

I prefer using the MS Notepad tool to do the conversion.

xiaoz · 2005-07-06

1) After you have chosen texts, click on the button for "test Unicode" (the third from right on the toolbar, shaped like 人). If your corpus is in UTF-8, you will get a warning as in the screen dump.

(The status for encoding in the table when you load your corpus: U for Unicode, 8 for UTF-8, a for any other encoding)

2) You can of course do conversions using Notepad etc, but only one file can be converted a time. This approach is not recommended for large corpora. For Chinese data, Scott Piao's MLCT (available at this site) is recommended to first convert GBK or Big-5 to UTF-8 (will not do if you convert directly to UTF-16 as, I think, MLCT and WS4 are using different types of UTF-16 - big-endian vs. little-endian), and then load the UTF-8 data into WS4, allowing WS4 to do the final conversion.

动态语法 · 2005-07-07

Thanks for the useful tips.

PS: These standards never fail to amaze me: just when you think Unicode is good enough, you got UTF-8, 16, 32 ...

[本贴已被作者于 2005年07月07日 11时33分01秒编辑过]

xiaoz · 2005-07-07

For Chinese corpora, UTF-8 can save disk space as all English letters (used for markup and annotation, for example) are still one byte while all Chinese characters are two bytes. In UTF-16, everything becomes double-byte. UTF-32 is currently used rarely. It is in place to ensure upward compatibility with the 64-bit system.

xujiajin · 2005-07-08

回复：Character encoding in corpus construction

以下是引用 xiaoz 在 2005-7-7 21:06:49 的发言：
For Chinese corpora, UTF-8 can save disk space as all English letters (used for markup and annotation, for example) are still one byte while all Chinese characters are two bytes. In UTF-16, everything becomes double-byte. UTF-32 is currently used rarely. It is in place to ensure upward compatibility with the 64-bit system.

Understood.
Yes. Any complexity has either motivated by efficiency or stupidity.

patricx · 2005-07-24

i also notice this problem, most of the Chinese texts are in ASCI code, and wordsmith4 only recognizes Chinese texts in Unicode, and some other concordance programs recognize different codes, my question is:
how can we swith from one code to another freely? esp. for large texts

xujiajin · 2005-07-24

Try MLCT.
http://www.corpus4u.org/showthread.php?t=90

Character encoding in corpus construction

xiaoz

永远的超级管理员

动态语法

管理员

xiaoz

永远的超级管理员

xujiajin

管理员

动态语法

管理员

xiaoz

永远的超级管理员

动态语法

管理员

xiaoz

永远的超级管理员

xujiajin

管理员

patricx

高级会员

xujiajin

管理员