简体繁体 English

如何在保留（非）字母数字属性的同时将多字节UTF-8字符表示形式转换为一个字节？

[英]How to convert multi-byte UTF-8 character representation to one byte while retaining (non)alphanumeric property?

原文 2011-03-11 17:22:13 5 3 c/ unicode/ utf-8

I have a UTF-8 string as a char* . 我有一个UTF-8字符串作为char* 。 In order to get the one byte per character property (and thus have random access into the string by character indexes) I currently just remove all UTF-8 continuation bytes from it (I would like to avoid "proper" conversion to a static byte width representation). 为了获得每个字符属性一个字节 （并因此可以通过字符索引对字符串进行随机访问），我目前只是从中删除所有UTF-8连续字节（我想避免“正确”转换为静态字节宽度）表示）。

Instead of removing all continuation bytes I would like to be able to check whether a given multi-byte UTF-8 character is alphanumeric (or not) and then replace it with a corresponding ASCII character (let's say a for alphanumerics and . otherwise). 除了删除所有连续字节以外，我希望能够检查给定的多字节UTF-8字符是否为字母数字（或不是），然后将其替换为对应的ASCII字符（假设a为字母数字，否则为. ）。 How do I do this? 我该怎么做呢？

3 个解决方案

There's no way to do this in general, as letters outside the ASCII range (such as α) may be accented as well (ἄ). 通常无法执行此操作，因为ASCII范围以外的字母（例如α）也可能带有重音符号（ἄ）。 But you can apply the NFD Unicode normalization to decompose accented codepoints into their constituents, then check whether the components lie within the ASCII range. 但是，您可以应用NFD Unicode归一化将重音代码点分解为它们的组成部分，然后检查组件是否在ASCII范围内。 ICU has normalization support . ICU具有标准化支持。

For each byte in the string: 对于字符串中的每个字节：

If it is an ASCII byte, just copy it. 如果它是ASCII字节，则只需将其复制。
If it is a UTF-8 head byte, decode starting from that byte to wchar_t using mbrtowc , determine an ASCII character whose classification matches by comparing the results of the isw*() functions, and copy that ASCII character to the output. 如果它是UTF-8头字节，则使用mbrtowc从该字节开始解码为wchar_t ，通过比较isw*()函数的结果来确定其分类匹配的ASCII字符，然后将该ASCII字符复制到输出中。
If it is anything else, skip it. 如果还有其他情况，请跳过它。

Unicode got total 1114111 (0x10FFFF) as highest code points, that means almost over a million characters. Unicode共有1114111（0x10FFFF）作为最高代码点，这意味着将近一百万个字符。 Single byte can represent 256 characters. 一个字节可以代表256个字符。

So simple answer is you can't do it, that way. 如此简单的答案就是您不能那样做。

As far I understand from question, you want this for random access to characters in the string. 据我了解，您希望它可以随机访问字符串中的字符。 You use 32bit characters. 您使用32位字符。 (Correct me If I am wrong). （如果我错了，请纠正我）。

Rather then handling it by writing your code use ICU , and using converter convert it into UTF-32 (4 byte character). 而不是通过使用ICU编写代码来处理它，并使用转换器将其转换为UTF-32 （4字节字符）。 ucnv_convertEx is the function to be used for this. ucnv_convertEx是用于此的函数。