简体   繁体   中英

Check if a char* buffer contains UTF8 characters?

在没有BOM的情况下,有一种快速而又脏的方法可以检查char *缓冲区是否包含UTF8字符?

You can test the hypothesis that it could, but I believe you can only end up knowing that it does not with certainty. In other words, you can examine the buffer to see if all byte sequences are legal UTF-8, that the code points are represented with the least number of bytes, that no 16-bit surrogate codes are present, and so forth. A buffer that passes all of those criteria might seem to be text, but you could be fooled.

In addition to the Raymond Chen discussion at Old New Thing cited by Mark Pim's answer, the buffer could actually contain x86 machine code that just happens to be restricted to the subset that seems to be 7-bit printable ASCII. Amazingly you actually can write meaningful programs in that subset, one example of which is the EICAR anti-virus test virus.

Of course, a buffer that contains byte sequences that are malformed UTF-8 is probably not UTF-8 text at all. In that case, you have a high degree of confidence. Then the trick is to figure out what encoding it might actually be.

If you know (or can assume) something about the semantic content of the buffer, then you could also use that to support your determination. For example, if the buffer is supposed to contain English text, then it is highly unlikely to have codepoints from Korean in it, and it should generally be spelled correctly, follow English grammar, and so forth. This can get expensive to test, of course...

Not reliably. See Raymond Chen's series of posts on the subject.

The problem is that UTF-8 without a BOM is all too often indistinguishable from equally valid ANSI encoding. I think most solutions (like the win32 API IsTextUnicode ) use various heuristics to give a best guess to the format of the text.

For quick and dirty, you can't do much better than the regex on this page . If you just want to know whether it's safe to decode the bytes as UTF-8, that's all you need.

Simply test that the byte sequence is valid as UTF-8. If it is, the probability of it being meaningful text in any other encoding is essentially zero.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM