简体繁体 English

强制wchar_t为4个字节

[英]compelling wchar_t to be 4 bytes

原文 2014-01-16 18:41:40 4 1 c/ utf-8/ utf/ wchar-t

Practical question - I'm working on a little piece of app which runs on 2 separate hardware platforms. 实际问题-我正在研究一个可以在2个单独的硬件平台上运行的小应用程序。

The compilation method and it's configuration is defined and controlled by me. 编译方法及其配置由我定义和控制。

My app receives a UTF-8/ISO-8859 text , and should perform some basic manipulation on the string (copying, searching etc). 我的应用程序收到UTF-8 / ISO-8859文本，并且应该对字符串执行一些基本操作（复制，搜索等）。

Thing is, one compiler is GCC (sizeof(wchar_t) == 4) and the other is Mingw(sizeof(wchar_t) == 2). 问题是，一个编译器是GCC（sizeof（wchar_t）== 4），而另一个是Mingw（sizeof（wchar_t）== 2）。

In order to support all UTF-8 possibilities, I was thinking of "typedef"in wchar_t in my code to be of type uint32_t, so that will force the Mingw compiler to be on the same line, and covering all UTF-8 options. 为了支持所有UTF-8可能性，我认为代码中wchar_t中的“ typedef”类型为uint32_t，因此将迫使Mingw编译器在同一行上，并覆盖所有UTF-8选项。

I'm then planning to use the wide-char manipulation functions as provided by the standard library (mbstowcs, wcscmp, wcscpy, ex..) 然后，我打算使用标准库（mbstowcs，wcscmp，wcscpy等）提供的宽字符操作功能。

Question is, will "forcing" the compiler to use more room, could have some bad impact (besides performance) on the library functioning (will mbtowcs even work here after the change? ) 问题是，是否会“迫使”编译器使用更多空间，可能会对库功能产生一些不良影响（除了性能）（更改后mbtowcs甚至可以在这里工作吗？）

I tried using ICU, but it is a very large library and thus breaks the deal. 我尝试使用ICU，但这是一个非常大的库，因此无法达成协议。 i need it small and reliable . 我需要它小巧可靠。

Thanks 谢谢

1 个解决方案

Here are your options for string manipulation: 这是用于字符串操作的选项：

Use unsigned char (or char ) and UTF-8. 使用unsigned char （或char ）和UTF-8。 All the regular string manipulation functions work (like strlen() , strstr() , snprintf() , etc.). 所有常规的字符串操作函数均起作用（例如strlen() ， strstr() ， snprintf()等）。
Use wchar_t and use a different encoding on different platforms (Win32 uses UTF-16, OS X and Linux use UTF-32). 在不同的平台上使用wchar_t并使用不同的编码（Win32使用UTF-16，OS X和Linux使用UTF-32）。 This is a path of madness, since you have to support two different encodings in the same code base. 这是一条疯狂的路，因为您必须在同一代码库中支持两种不同的编码。
Use UTF-32 or UTF-16 and your own string manipulation functions. 使用UTF-32或UTF-16和您自己的字符串操作函数。 This is a lot of work, but it is portable. 这是很多工作，但是是可移植的。
Use ICU and UTF-16. 使用ICU和UTF-16。

For most purposes, manipulating strings in UTF-8 works very well. 在大多数情况下，以UTF-8操作字符串非常有效。 It depends on what your program does. 这取决于您的程序做什么。 If you are doing things like parsing and templating, UTF-8 is easy to work with. 如果您正在执行诸如解析和模板化之类的工作，那么UTF-8便很容易使用。 If you need more sophisticated functionality, such as iterating over break points or finding grapheme cluster boundaries, then you will need a library like Glib (which uses UTF-8) or ICU (which uses UTF-16). 如果您需要更复杂的功能（例如遍历断点或查找字素簇边界），则需要一个库，例如Glib（使用UTF-8）或ICU（使用UTF-16）。

A note about indexes 关于索引的注释

You may be used to indexing strings using character / code point indexes. 您可能习惯于使用字符/代码点索引来为字符串编制索引。 Get used to indexing strings using code unit indexes: so strlen() returns the number of bytes, not the number of characters. 习惯于使用代码单元索引为字符串建立索引：所以strlen()返回字节数，而不是字符数。 However, it is very rare to actually need to index a string by character position. 但是，实际上很少需要按字符位置索引字符串。