A naive question: what is unicode and UTF-8

xusun575

高级会员
Quite a few Sombodies here have mentioned "unicode" and UTF_8 in their postings. Here I have a very naive question: what is it or what are they? Does unicode means UTF-8, or vice versa?
 
http://www.corpus4u.com/forum_view.asp?view_id=416&forum_id=34
 
The 911 report downloaded from our site is stored in UNICODE big endian. Yet it can't be picked up by MONOCONCPRO. When I tried to transfer it from unicode to ansci, which was recognizable by MONOCONCPRO, there was a loss of half size of file in the transferrance. I am not sure if there was any content lost in the process. Is there any device available that could change bwteen these two storing standards without loss of content?
 
Check the start and the end of and a couple of randomly selected lines amongst the document to see whether the content is squeezed or corrupted in some measure.
 
Ok! I will try it out! I remember that there is a concordancer that could change one storing standard to another. Probably that is the safest way of transferrance, right! thanks a lot, dr. xujiajin!
 
Changes in file sizes are only natural in encoding conversion. In ASCII and UTF-8, each alphabet/numeral etc takes one byte (but Chinese characters takes up to two bytes); In Unicode (UTF-16, big or little-endian), everything takes up to two bytes. In UTF-32, everything takes four bytes. Conversion between these encodings will change file sizes.
 
So you mean that the shrink in the size is inevitable, which doesn't mean that there is loss of content? I find that there is nothing irregualr out the conversion seemingly, following dr. xujiajin! thanks a lot, dr. xiao zhonghua and dr. xu jiajin!
 
Is it possible that some part of the document is stored in UNICODE while the other part is stored in ASCII. Just now when I selected, pasted and saved a file from Oxford TEXT Archive, it reminded me that some of the content was stored in unicode.
 
Accented characters in some European languages, some special symbols in English, and characters from non-alphabet writing systems must be saved as a Unicode (or UTF-8) file if you want to reserve those characters properly.

If most of your are ordinary English letters and there are also some special characters, it will save disk space if you save the file in utf-8. Then all ordinary characters takes one byte and only special characters takes two bytes each.

It is impossible to save part of a file as ASCII and part as Unicode (UTF-8).
 
Back
顶部