简体繁体 English

有没有一种方法可以检查字符串是否在C ++中包含Unicode字符

[英]Is there a way to check whether a string contains unicode characters in C++

原文 2014-12-17 09:39:15 6 2 c++/ visual-c++

Is there a way to check whether a string contains unicode characters in C++ 有没有一种方法可以检查字符串是否在C ++中包含Unicode字符

I have a string and I need to validate whether it contains unicode(UTF-8 or UTF-16) characters. 我有一个字符串，我需要验证它是否包含unicode（UTF-8或UTF-16）字符。 If it does I need to convert them to ASCII. 如果是这样，我需要将它们转换为ASCII。 I have some idea about the conversion logic., but need some help in detecting the unicode characters in the string 我对转换逻辑有一些了解，但是在检测字符串中的Unicode字符时需要一些帮助

2 个解决方案

You cannot tell in full generality. 您不能完全笼统地说。

A string is just a sequence of characters (which could be of any size). 字符串只是一个字符序列（可以是任意大小）。 The encoding ; 编码 ; inextricably associated with such a sequence; 与这样的序列密不可分； attaches textural meaning to the string. 将结构意义附加到字符串上。

In Windows, the encoding used is UTF-16 which does allow you to have a punt. 在Windows中，使用的编码为UTF-16，它允许您使用平底锅。 It provides the API function IsTextUnicode which can help. 它提供了可以提供帮助的API函数IsTextUnicode 。 But do take note that there's no guarantee it will work. 但是请注意，不能保证它会起作用。

There's no 100% guaranteed solution. 没有100％保证的解决方案。 I'd start by reading the first 100 or so bytes, and try to determine the encoding: 我将从读取前100个字节左右开始，然后尝试确定编码：

If the file starts with the three byte sequence 0xEF, 0xBB, 0xBF, it's probably UTF-8. 如果文件以三字节序列0xEF，0xBB，0xBF开头，则可能是UTF-8。 In this case, drop these three, and process the rest as UTF-8, below. 在这种情况下，请删除这三个，然后将其余部分作为UTF-8处理。
If the file starts with the two byte sequence 0xFE, 0xFF, it's probably UTF16BE. 如果文件以两个字节序列0xFE，0xFF开头，则可能是UTF16BE。 Drop these two, and process the rest as UTF16BE, below. 删除这两个，然后将其余部分作为UTF16BE处理。
If the file starts with the two byte sequence 0xFF, 0xFE, it's probably UTF16LE. 如果文件以两个字节序列0xFF，0xFE开头，则可能是UTF16LE。 Drop these two, and process the rest as UTF16LE, below. 放下这两个，然后将其余部分作为UTF16LE处理。
If every other byte, starting with the first, are mostly 0, then the file is probably UTF16BE. 如果从第一个字节开始的所有其他字节大部分都是0，则该文件可能是UTF16BE。 (How much is mostly depends; depending on the source of the data, even more than a couple could be sufficient.) Process as UTF16BE, below. （多少取决于大多数情况；取决于数据的来源，甚至多于几个就足够了。）处理如下UTF16BE。
If every other byte, starting with the second, are mostly 0, the it's probably UTF16LE (very frequent in the Windows world). 如果从第二个字节开始的所有其他字节大部分都是0，则可能是UTF16LE（在Windows世界中非常频繁）。
Otherwise, it's anyone's guess, but processing it as if it were UTF-8 (without dropping any bytes) is probably acceptable. 否则，这是任何人的猜测，但是将其视为UTF-8（不丢失任何字节）进行处理可能是可以接受的。

As for how to process the file: 至于如何处理文件：

For UTF-8, just check that all of the remaining bytes are in the range [0,128). 对于UTF-8，只需检查所有剩余字节是否在[0,128）范围内。 If they aren't, the file can't be converted to ASCII. 如果不是，则无法将文件转换为ASCII。 If they are, the file is ASCII (as well as being UTF-8). 如果是这样，则文件为 ASCII（以及UTF-8）。 This is also valid for most single byte encodings, eg all of the ISO-8859 encodings (which are still widespread). 这对于大多数单字节编码（例如，所有ISO-8859编码（仍很流行））也有效。
For UTF16BE, every other byte, starting at the first, should be 0, and the remaining bytes in the range [0,128). 对于UTF16BE，每隔一个字节（从第一个字节开始）应为0，其余字节应在[0,128）范围内。 If they aren't, the file can't be converted to ASCII. 如果不是，则无法将文件转换为ASCII。 If they are, take every other byte, starting at the second. 如果是这样，请从第二个字节开始，每隔一个字节。
For UTF16LE, every other byte, starting at the second, should be 0, and the remaining bytes in the range [0,128). 对于UTF16LE，每隔一个字节（从第二个字节开始）应为0，其余字节应在[0,128）范围内。 If they aren't, the file can't be converted to ASCII. 如果不是，则无法将文件转换为ASCII。 If they are, take every other byte, starting at the first. 如果是这样，则从第一个字节开始，每隔一个字节。

In all cases, this processing starts after dropping any bytes from the first step. 在所有情况下，从第一步删除任何字节后，便开始此处理。

Finally, you don't say what you are trying to do. 最后，您不说您要做什么。 There are encoding conventions which allow representing all Unicode characters in pure ASCII; 有一些编码约定，可以用纯ASCII表示所有Unicode字符。 if the ASCII you generate will be processed by code expecting one of these conventions, then you'll have to process the full Unicode (including surrogate pairs in the UTF-16) and convert the Unicode to whatever encoding the target program expects. 如果生成的ASCII将由期望使用这些约定之一的代码处理，则必须处理完整的Unicode（包括UTF-16中的代理对）并将Unicode转换为目标程序期望的任何编码。 C++, for example, expects universal character names; 例如，C ++需要通用字符名称； the representation for é , for example, would be \é . é的表示形式例如是\é 。 Which means you'd also have to convert \\ to \\\\ . 这意味着您还必须将\\转换为\\\\ 。 (As far as I know, this convention only applies to programming languages, like C, C++ and Java.) （据我所知，该约定仅适用于编程语言，例如C，C ++和Java。）