I only have experience with processing ASCII (single byte characters) and have read a number of posts on how people process Unicode differently which present their own set of issues.
At this point of my very limited exposure to Unicode, I've read that internal processing with UTF-16 presents portability and other issues .
I feel that UTF-32 makes more sense than UTF-16 since all Unicode characters fit within 4 bytes but would consume more resources, especially if you are mainly dealing with ISO-8859-1 characters.
I humbly feel that UTF-8 could be an ideal format to work with internally (especially for case where you deal mainly with English and Latin based characters) since the ASCII range of characters would be handled byte by byte very efficiently. Characters from the Latin alphabet would consume two bytes and other characters would consume more bytes of course.
Another advantage that I see is that UTF-8 strings could be stored within regular C++ std::string or C string arrays which seems so natural.
The disadvantage for using UTF-8 for me at least is that I have not found any libraries to support UTF-8 internally. For example, I have not found any libraries for UTF-8 case conversion and substring operations.
Another disadvantage for me is that I have not found functions to parse bytes within a UTF-8 string for character processing.
Would it be feasible to work with UTF-8 internally and are there any support libraries available for this purpose? I do hope so but if not, I think that my best option would be to forget using UTF-8 internally and use Boost::Locale since I've read that ICU is a mature library used by many to handle Unicode.
I would really like to hear your opinions on this matter.
I bumped into my very old answer and I'll tell you what I ended up doing. I decided to stick with UTF-8 and store my data in std::string or single byte char arrays . There was never a need for me to use multi-byte characters!
The first library that I used was UTF8-CPP which is very easy to bring into your app and use. But you soon find that you need more and more capability.
I really wanted to avoid using ICU because it is such a large library, but once you build it and get it installed, you begin to wish that you had done it in the first place because it has everything you need and much, much more.
What are my benefits you may wonder:
Drawbacks:
When I looked at built-in language features, I found several lacking such as lower/upper case conversion, word boundaries, counting characters, accent sensitivity, string manipulation such as substrings, etc. Local support is also totally amazing.
I guess that summarizes entire exercise in UTF-8.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.