简体   繁体   中英

How do I perform operation on Chinese characters (UTF-8) in C?

If the input is something like 世界+你好 how can I perform these UTF-8 unicode operations in C?:

  1. Split the string at the + character and place the two sections of Chinese characters into two separate arrays: str1 = 世界 and str2 = 你好 .
  2. Compare two arrays to see if the Chinese characters are the same.

The Chinese characters will be inputted from terminal.

One of the nice things about UTF-8 is that if you find a byte that is a valid ASCII character (in particular, <128), that byte is representing that ASCII character. Therefore, you can just split at the + character as if you had a single-byte encoding string.

Comparison in your case is also simply byte-wise. It gets much harder when you have to worry about canonical forms or case-sensitivity, but as far as I know, neither of those applies to Chinese. (Of course, you might have different characters you want to treat as identical, such as 気 and 氣. If so, normalize the strings first with a standard search-and-replace.)

I work with Chinese characters for many years, and I do not remember ever "performing operations in UTF-8". Let me explain, UTF-8 is encoding, it is not supposed to be used in-memory to do operations. When UTF-8 was invented, the idea was that english is the important language, and we need the rest some-how, so in UTF-8 English characters are first-class citizens, unlike Chinese.

As the word "encoding" implies, you must DECODE the data before you can use it. It is like "performing operations on characters in ZIP encoding". Of course, you can do something with characters if you load the file into a buffer, but you will be decoding any way, either decode the whole buffer and then perform the operations, or decode on fly, character by character and doing some operations at the same time.

What exactly I mean by "decoding"? Normally you will use C-type unsigned short or wchar_t , or sometimes int to hold each character. So you load your UTF-8 text into a char utf8buffer[] buffer, then decode it to another buffer wchar_t utf16buffer[] . Then you do whatever you need to do, then you encode back to UTF-8 and save to disc.

As you can see UTF16 is enough to deal with Chinese:

 L'一' == 0x4e00; // first Chinese character "yi" - "one"
 L'龥' == 0x9fa9; // the last Chinese char that I know of.
                  // From 0xa000 the Korean alphabet (Hangul) starts.

But this only applies to common Chinese, there are rare characters that are only used in ancient literature by scholars that will not fit into 0xFFFF range. Actually Chinese "alphabet" is not fixed, you can combine any Chinese "radicals" and characters into a new character. This is not fixed! There is even a UNICODE utility for this called Ideographic Description Sequences "IDS" in short. But hopefully you do not need that at all.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM