简体   繁体   English

C / C ++中的Unicode字符串规范化

[英]Unicode string normalization in C/C++

Am wondering how to normalize strings (containing utf-8/utf-16) in C/C++. 我想知道如何在C / C ++中规范化字符串(包含utf-8 / utf-16)。 In .NET there is a function String.Normalize . 在.NET中有一个函数String.Normalize

I used UTF8-CPP in the past but it does not provide such a function. 我过去使用过UTF8-CPP,但它没有提供这样的功能。 ICU and Qt provide string normalization but I prefer lightweight solutions. ICU和Qt提供字符串规范化,但我更喜欢轻量级解决方案。

Is there any "lightweight" solution for this? 对此有任何“轻量级”解决方案吗?

正如我在另一个问题中所写, utf8proc是一个非常好的,轻量级的库,用于基本的Unicode功能,包括Unicode字符串规范化。

For Windows, there is the NormalizeString() function (unfortunately for Vista and later only - as far as I see on MSDN): 对于Windows,有NormalizeString()函数(不幸的是Vista和以后只有 - 就我在MSDN上看到的那样):

http://msdn.microsoft.com/en-us/library/windows/desktop/dd319093%28v=vs.85%29.aspx http://msdn.microsoft.com/en-us/library/windows/desktop/dd319093%28v=vs.85%29.aspx

It's the simplest way to go that I have found so far. 到目前为止,这是我发现的最简单的方法。 I guess it's quite lightweight too. 我猜它也很轻巧。

int NormalizeString(
    _In_      NORM_FORM NormForm,
    _In_      LPCWSTR   lpSrcString,
    _In_      int       cwSrcLength,
    _Out_opt_ LPWSTR    lpDstString,
    _In_      int       cwDstLength
);

You could build ICU with minimal (or possibly, no other data- I think all of the normalization data is now internal), and then statically link. 您可以用最少的(或者可能没有其他数据 - 我认为所有规范化数据现在都是内部的)来构建ICU,然后静态链接。 I haven't tried this recently, but I believe the total size is pretty small in that case. 我最近没有尝试过,但我认为在这种情况下总的尺寸非常小。

A good UTF-8 solution is glib's g_utf8_normalize() function. 一个好的UTF-8解决方案是glib的g_utf8_normalize()函数。 Would require to convert std::wstring to std::string (utf16 to utf8) if you need this for wstring too (which would make it quite an expensive solution, hence I'm looking myself for a better solution, if possible with pure C++(11) means). 需要将std :: wstring转换为std :: string(utf16到utf8),如果你也需要这个用于wstring(这将使它成为一个非常昂贵的解决方案,因此我正在寻找一个更好的解决方案,如果可能的话,纯粹的C ++(11)意味着)。

"Lightweight" in your context means "with limited functionality". 您的上下文中的“轻量级”意味着“功能有限”。 I would use ICU source as an example, and reference http://unicode.org/reports/tr15/ to implement this "lightweight" functionality. 我将使用ICU源代码作为示例,并参考http://unicode.org/reports/tr15/来实现这种“轻量级”功能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM