[英]How can I audit my Windows application for correct Unicode handling?
I can't use prepackaged Unicode string libraries, such as ICU, because they blow up the size of the binary to an insane degree (it's a 200k program; ICU is 16MB+!). 我不能使用预先打包的Unicode字符串库,例如ICU,因为它们将二进制文件的大小炸成疯狂程度(它是200k程序; ICU是16MB +!)。
I'm using the builtin wchar_t
string type for everything already, but I want to ensure I'm not doing anything stupid in terms of doing iteration on strings, or things like that. 我已经使用了内置的
wchar_t
字符串类型,但是我想确保在对字符串进行迭代或类似的事情时我没有做任何愚蠢的事情。
Are there tools like Fuzzers do for security but for Unicode? 是否有像Fuzzers这样的工具用于安全性但是用于Unicode? That is, throw characters outside of the Basic Multilingual Plane at my code and ensure things get handled correctly as UTF-16?
也就是说,在我的代码中将基本多语言平面之外的字符抛出,并确保以UTF-16正确处理事物?
(Oh, and obviously a cross platform solution works, though most cross platform things would have to support both UTF-8 and UTF-16) (哦,显然跨平台解决方案可行,但大多数跨平台的东西都必须支持UTF-8和UTF-16)
EDIT : Also note things that are less obvious than UTF-16 surrogate pairs -- things like accent marks! 编辑 :还要注意比UTF-16代理对更不明显的东西 - 像重音标记!
Some things to check: 有些事要检查:
Make sure that instead of handing WM_CHAR
you're handling WM_UNICHAR
: 确保不是处理
WM_CHAR
而是处理WM_UNICHAR
:
The
WM_UNICHAR
message is the same asWM_CHAR
, except it uses UTF-32.WM_UNICHAR
消息与WM_CHAR
相同,但它使用UTF-32。 It is designed to send or post Unicode characters to ANSI windows, and it can handle Unicode Supplementary Plane characters .它旨在将Unicode字符发送或发布到ANSI窗口, 它可以处理Unicode补充平面字符 。
Do not assume that the i th character is at index i
. 不要以为第 i 个字符在索引
i
。 It obviously isn't, and if you happen to use that fact for, say, breaking a string in half, then you could be messing it up. 它显然不是,如果你碰巧使用这个事实,比如说,将一个字符串分成两半,那么你可能会搞砸它。
Don't tell the user (in a status bar or something) that the user has N characters just because the character array has length N. 不要仅仅因为字符数组的长度为N而告诉用户(在状态栏或其他内容中)用户有N个字符。
Use WM_UNICHAR
, it handles UTF-32 and can handle Unicode Supplementary Plane characters. 使用
WM_UNICHAR
,它处理UTF-32并且可以处理Unicode Supplementary Plane字符。
While this is almost true, but the complete truth looks like this: 虽然这几乎是正确的,但完整的事实看起来像这样:
WM_UNICHAR
is a hack designed for ANSI Windows to receive Unicode characters. WM_UNICHAR
是为ANSI Windows设计的黑客,用于接收Unicode字符。 Create a Unicode window and you will never receive it. WM_UNICHAR
with 0xffff
to which you must react by returning 1 (the default window procedure will return 0). 0xffff
的WM_UNICHAR
,您必须通过返回1来响应(默认窗口过程将返回0)。 Fail to do this, and you will never see a WM_UNICHAR
again. WM_UNICHAR
。 Good job that the official documentation doesn't tell you that. WM_UNICHAR
(such as my Windows 7 64 system) and it still won't work, even if you do everything correctly. WM_UNICHAR
(例如我的Windows 7 64系统),即使你正确地执行了所有操作,它仍然无法工作。 There is nothing to audit or to pay attention to. 没有什么可以审核或注意的。
Compile with UNICODE
defined, or explicitly create your window class as well as your window using a " W
" function, and use WM_CHAR
as if this was the most natural thing to do. 使用
UNICODE
定义编译,或使用“ W
”函数显式创建窗口类和窗口,并使用WM_CHAR
,就好像这是最自然的事情。 That's it. 而已。 It is indeed the most natural thing.
这确实是最自然的事情。
WM_CHAR
uses UTF-16 (except when it doesn't, such as under Windows 2000). WM_CHAR
使用UTF-16(除非它没有,例如在Windows 2000下)。 Of course, a single UTF-16 character cannot represent code points outside the BMP, but that is not a problem because you simply get two WM_CHAR
messages containing a surrogate pair. 当然,单个UTF-16字符不能代表BMP之外的代码点,但这不是问题,因为您只需获得两个包含代理项对的
WM_CHAR
消息。 It's entirely transparent to your application, you do not need to do anything special. 它对您的应用程序完全透明,您不需要做任何特殊的事情。 Any Windows API function that accepts a wide character string will happily accept these surrogates, too.
任何接受宽字符串的Windows API函数都会很乐意接受这些代理。
The only thing to be aware of is that now the character length of a string (obviously) is no longer simply the number of 16-bit words. 唯一需要注意的是,现在字符串的字符长度(显然)不再仅仅是16位字的数量。 But that was a wrong assumption to begin with, anyway.
但无论如何,这是一个错误的假设。
In reality, on many (most? all?) systems, you just get a single WM_CHAR
message with wParam
containing the low 16 bits of the key code. 实际上,在许多(大多数?全部?)系统中,您只需获得一条
WM_CHAR
消息,其中wParam
包含密钥代码的低16位。 Which is mighty fine for anything within the BMP, but sucks otherwise. 对于BMP中的任何内容来说,这都是非常好的,但其他方面则很糟糕。
I have verified this both by using Alt-keypad codes and creating a custom keyboard layout which generates code points outside the BMP. 我已经通过使用Alt键盘代码和创建自定义键盘布局来验证这一点,该布局在BMP之外生成代码点。 In either case, only a single
WM_CHAR
is received, containing the lower 16 bits of the character. 在任何一种情况下,只接收一个
WM_CHAR
,包含该字符的低16位。 The upper 16 bits are simply thrown away. 上面的16位被丢弃了。
In order for your program to work 100% correctly with Unicode, you must apparently use the input method manager ( ImmGetCompositionStringW
), which is a nuisance and badly documented. 为了使您的程序能够100%正确地使用Unicode,您必须使用输入法管理器(
ImmGetCompositionStringW
),这是一个令人讨厌并且记录严重的文档。 For me, personally, this simply means: "OK, screw that". 对我个人来说,这只是意味着:“好吧,搞砸了”。 But if you are interested in being 100% correct, look at the source code of any editor using Scintilla (link to line) which does just that and works perfectly.
但如果您对100%正确感兴趣,请查看使用Scintilla (链接到行)的任何编辑器的源代码,这样做并且完美无缺。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.