简体繁体 English

如何审核我的Windows应用程序以获得正确的Unicode处理？

[英]How can I audit my Windows application for correct Unicode handling?

原文 2011-06-20 15:42:08 2 2 c++/ winapi/ unicode

I can't use prepackaged Unicode string libraries, such as ICU, because they blow up the size of the binary to an insane degree (it's a 200k program; ICU is 16MB+!). 我不能使用预先打包的Unicode字符串库，例如ICU，因为它们将二进制文件的大小炸成疯狂程度（它是200k程序; ICU是16MB +！）。

I'm using the builtin wchar_t string type for everything already, but I want to ensure I'm not doing anything stupid in terms of doing iteration on strings, or things like that. 我已经使用了内置的wchar_t字符串类型，但是我想确保在对字符串进行迭代或类似的事情时我没有做任何愚蠢的事情。

Are there tools like Fuzzers do for security but for Unicode? 是否有像Fuzzers这样的工具用于安全性但是用于Unicode？ That is, throw characters outside of the Basic Multilingual Plane at my code and ensure things get handled correctly as UTF-16? 也就是说，在我的代码中将基本多语言平面之外的字符抛出，并确保以UTF-16正确处理事物？

(Oh, and obviously a cross platform solution works, though most cross platform things would have to support both UTF-8 and UTF-16) （哦，显然跨平台解决方案可行，但大多数跨平台的东西都必须支持UTF-8和UTF-16）

EDIT : Also note things that are less obvious than UTF-16 surrogate pairs -- things like accent marks! 编辑：还要注意比UTF-16代理对更不明显的东西 - 像重音标记！

2 个解决方案

Some things to check: 有些事要检查：

Make sure that instead of handing WM_CHAR you're handling WM_UNICHAR : 确保不是处理WM_CHAR而是处理WM_UNICHAR ：

The WM_UNICHAR message is the same as WM_CHAR , except it uses UTF-32. WM_UNICHAR消息与WM_CHAR相同，但它使用UTF-32。 It is designed to send or post Unicode characters to ANSI windows, and it can handle Unicode Supplementary Plane characters . 它旨在将Unicode字符发送或发布到ANSI窗口， 它可以处理Unicode补充平面字符 。
Do not assume that the i ^th character is at index i . 不要以为^第 i ^个字符在索引i 。 It obviously isn't, and if you happen to use that fact for, say, breaking a string in half, then you could be messing it up. 它显然不是，如果你碰巧使用这个事实，比如说，将一个字符串分成两半，那么你可能会搞砸它。
Don't tell the user (in a status bar or something) that the user has N characters just because the character array has length N. 不要仅仅因为字符数组的长度为N而告诉用户（在状态栏或其他内容中）用户有N个字符。

The wrong answer 错误的答案

Use WM_UNICHAR , it handles UTF-32 and can handle Unicode Supplementary Plane characters. 使用WM_UNICHAR ，它处理UTF-32并且可以处理Unicode Supplementary Plane字符。

While this is almost true, but the complete truth looks like this: 虽然这几乎是正确的，但完整的事实看起来像这样：

WM_UNICHAR is a hack designed for ANSI Windows to receive Unicode characters. WM_UNICHAR是为ANSI Windows设计的黑客，用于接收Unicode字符。 Create a Unicode window and you will never receive it. 创建一个Unicode窗口，你永远不会收到它。
Create an ANSI window and you will be surprised that it still doesn't work as expected. 创建一个ANSI窗口，你会惊讶它仍然没有按预期工作。 The catch is that when the window is created, you receive a WM_UNICHAR with 0xffff to which you must react by returning 1 (the default window procedure will return 0). 问题是，当窗口创建时，您会收到一个带有0xffff的WM_UNICHAR ，您必须通过返回1来响应（默认窗口过程将返回0）。 Fail to do this, and you will never see a WM_UNICHAR again. 不能这样做，你永远不会再看到WM_UNICHAR 。 Good job that the official documentation doesn't tell you that. 好的工作，官方文档没有告诉你。
Run your program on a system that, for mysterious reasons, doesn't support WM_UNICHAR (such as my Windows 7 64 system) and it still won't work, even if you do everything correctly. 在一个系统上运行你的程序，出于神秘的原因，它不支持WM_UNICHAR （例如我的Windows 7 64系统），即使你正确地执行了所有操作，它仍然无法工作。

The theoretically* correct answer* 理论上正确答案

There is nothing to audit or to pay attention to. 没有什么可以审核或注意的。

Compile with UNICODE defined, or explicitly create your window class as well as your window using a " W " function, and use WM_CHAR as if this was the most natural thing to do. 使用UNICODE定义编译，或使用“ W ”函数显式创建窗口类和窗口，并使用WM_CHAR ，就好像这是最自然的事情。 That's it. 而已。 It is indeed the most natural thing. 这确实是最自然的事情。

WM_CHAR uses UTF-16 (except when it doesn't, such as under Windows 2000). WM_CHAR使用UTF-16（除非它没有，例如在Windows 2000下）。 Of course, a single UTF-16 character cannot represent code points outside the BMP, but that is not a problem because you simply get two WM_CHAR messages containing a surrogate pair. 当然，单个UTF-16字符不能代表BMP之外的代码点，但这不是问题，因为您只需获得两个包含代理项对的WM_CHAR消息。 It's entirely transparent to your application, you do not need to do anything special. 它对您的应用程序完全透明，您不需要做任何特殊的事情。 Any Windows API function that accepts a wide character string will happily accept these surrogates, too. 任何接受宽字符串的Windows API函数都会很乐意接受这些代理。
The only thing to be aware of is that now the character length of a string (obviously) is no longer simply the number of 16-bit words. 唯一需要注意的是，现在字符串的字符长度（显然）不再仅仅是16位字的数量。 But that was a wrong assumption to begin with, anyway. 但无论如何，这是一个错误的假设。

The sad truth 伤心的真相

In reality, on many (most? all?) systems, you just get a single WM_CHAR message with wParam containing the low 16 bits of the key code. 实际上，在许多（大多数？全部？）系统中，您只需获得一条WM_CHAR消息，其中wParam包含密钥代码的低16位。 Which is mighty fine for anything within the BMP, but sucks otherwise. 对于BMP中的任何内容来说，这都是非常好的，但其他方面则很糟糕。

I have verified this both by using Alt-keypad codes and creating a custom keyboard layout which generates code points outside the BMP. 我已经通过使用Alt键盘代码和创建自定义键盘布局来验证这一点，该布局在BMP之外生成代码点。 In either case, only a single WM_CHAR is received, containing the lower 16 bits of the character. 在任何一种情况下，只接收一个WM_CHAR ，包含该字符的低16位。 The upper 16 bits are simply thrown away. 上面的16位被丢弃了。

In order for your program to work 100% correctly with Unicode, you must apparently use the input method manager ( ImmGetCompositionStringW ), which is a nuisance and badly documented. 为了使您的程序能够100％正确地使用Unicode，您必须使用输入法管理器（ ImmGetCompositionStringW ），这是一个令人讨厌并且记录严重的文档。 For me, personally, this simply means: "OK, screw that". 对我个人来说，这只是意味着：“好吧，搞砸了”。 But if you are interested in being 100% correct, look at the source code of any editor using Scintilla (link to line) which does just that and works perfectly. 但如果您对100％正确感兴趣，请查看使用Scintilla （链接到行）的任何编辑器的源代码，这样做并且完美无缺。