简体   繁体   中英

Is there a way to check whether a string contains unicode characters in C++

Is there a way to check whether a string contains unicode characters in C++

I have a string and I need to validate whether it contains unicode(UTF-8 or UTF-16) characters. If it does I need to convert them to ASCII. I have some idea about the conversion logic., but need some help in detecting the unicode characters in the string

You cannot tell in full generality.

A string is just a sequence of characters (which could be of any size). The encoding ; inextricably associated with such a sequence; attaches textural meaning to the string.

In Windows, the encoding used is UTF-16 which does allow you to have a punt. It provides the API function IsTextUnicode which can help. But do take note that there's no guarantee it will work.

There's no 100% guaranteed solution. I'd start by reading the first 100 or so bytes, and try to determine the encoding:

  • If the file starts with the three byte sequence 0xEF, 0xBB, 0xBF, it's probably UTF-8. In this case, drop these three, and process the rest as UTF-8, below.

  • If the file starts with the two byte sequence 0xFE, 0xFF, it's probably UTF16BE. Drop these two, and process the rest as UTF16BE, below.

  • If the file starts with the two byte sequence 0xFF, 0xFE, it's probably UTF16LE. Drop these two, and process the rest as UTF16LE, below.

  • If every other byte, starting with the first, are mostly 0, then the file is probably UTF16BE. (How much is mostly depends; depending on the source of the data, even more than a couple could be sufficient.) Process as UTF16BE, below.

  • If every other byte, starting with the second, are mostly 0, the it's probably UTF16LE (very frequent in the Windows world).

  • Otherwise, it's anyone's guess, but processing it as if it were UTF-8 (without dropping any bytes) is probably acceptable.

As for how to process the file:

  • For UTF-8, just check that all of the remaining bytes are in the range [0,128). If they aren't, the file can't be converted to ASCII. If they are, the file is ASCII (as well as being UTF-8). This is also valid for most single byte encodings, eg all of the ISO-8859 encodings (which are still widespread).

  • For UTF16BE, every other byte, starting at the first, should be 0, and the remaining bytes in the range [0,128). If they aren't, the file can't be converted to ASCII. If they are, take every other byte, starting at the second.

  • For UTF16LE, every other byte, starting at the second, should be 0, and the remaining bytes in the range [0,128). If they aren't, the file can't be converted to ASCII. If they are, take every other byte, starting at the first.

In all cases, this processing starts after dropping any bytes from the first step.

Finally, you don't say what you are trying to do. There are encoding conventions which allow representing all Unicode characters in pure ASCII; if the ASCII you generate will be processed by code expecting one of these conventions, then you'll have to process the full Unicode (including surrogate pairs in the UTF-16) and convert the Unicode to whatever encoding the target program expects. C++, for example, expects universal character names; the representation for é , for example, would be . Which means you'd also have to convert \\ to \\\\ . (As far as I know, this convention only applies to programming languages, like C, C++ and Java.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM