简体   繁体   中英

Working with UTF-8 vs UTF-16 vs UTF-32 internally within C++?

I only have experience with processing ASCII (single byte characters) and have read a number of posts on how people process Unicode differently which present their own set of issues.

At this point of my very limited exposure to Unicode, I've read that internal processing with UTF-16 presents portability and other issues .

I feel that UTF-32 makes more sense than UTF-16 since all Unicode characters fit within 4 bytes but would consume more resources, especially if you are mainly dealing with ISO-8859-1 characters.

I humbly feel that UTF-8 could be an ideal format to work with internally (especially for case where you deal mainly with English and Latin based characters) since the ASCII range of characters would be handled byte by byte very efficiently. Characters from the Latin alphabet would consume two bytes and other characters would consume more bytes of course.

Another advantage that I see is that UTF-8 strings could be stored within regular C++ std::string or C string arrays which seems so natural.

The disadvantage for using UTF-8 for me at least is that I have not found any libraries to support UTF-8 internally. For example, I have not found any libraries for UTF-8 case conversion and substring operations.

Another disadvantage for me is that I have not found functions to parse bytes within a UTF-8 string for character processing.

Would it be feasible to work with UTF-8 internally and are there any support libraries available for this purpose? I do hope so but if not, I think that my best option would be to forget using UTF-8 internally and use Boost::Locale since I've read that ICU is a mature library used by many to handle Unicode.

I would really like to hear your opinions on this matter.

I bumped into my very old answer and I'll tell you what I ended up doing. I decided to stick with UTF-8 and store my data in std::string or single byte char arrays . There was never a need for me to use multi-byte characters!

The first library that I used was UTF8-CPP which is very easy to bring into your app and use. But you soon find that you need more and more capability.

I really wanted to avoid using ICU because it is such a large library, but once you build it and get it installed, you begin to wish that you had done it in the first place because it has everything you need and much, much more.

What are my benefits you may wonder:

  • I write truly portable code that builds under VC++ for Windows or GCC for Linux.
  • ICU has everything, and I mean everything you need concerning unicode.
  • I am able to stick with my beloved std::string and char arrays.
  • I use many open source libraries in my apps with zero issues. For example, I use RapidJson for my JSON to create in-memory JSON objects containing UTF-8 data. I'm able to pass them to a web server or write them to disk, etc. Really simple.
  • I store my data into Firebird SQL but you need to specify your varchar and char field types as UTF8. This means that your strings will be stored as mutli-byte in the database. But this is totally transparent to you, the developer. I am certain that this applies to other SQL databases as well.

Drawbacks:

  • Large library, very scary and confusing at first.
  • The C++ was not written by C++ experts (like the Boost developers). But the code is totally stable and fast. You may not like the syntax used though. What I've done is to "wrap" common procedures with my code. This pretty much means that I include my own UTF-8 library which wraps the ICU uglies. Don't let this bother you because ICU is totally stable and fast.
  • I personally dynamically link ICU into my applications. This means that I first built ICU dynamically for my Win and Linux 64 bit environments. In the case of Windows, I store the dlls in a folder somewhere and add that to my Windows path so that any app that requires ICU can find the dlls.

When I looked at built-in language features, I found several lacking such as lower/upper case conversion, word boundaries, counting characters, accent sensitivity, string manipulation such as substrings, etc. Local support is also totally amazing.

I guess that summarizes entire exercise in UTF-8.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM