简体繁体中英

Check if a char* buffer contains UTF8 characters?

原文 2009-08-05 08:36:15 8 4 c++/ c/ utf-8

在没有BOM的情况下，有一种快速而又脏的方法可以检查char *缓冲区是否包含UTF8字符？

4 answers

You can test the hypothesis that it could, but I believe you can only end up knowing that it does not with certainty. In other words, you can examine the buffer to see if all byte sequences are legal UTF-8, that the code points are represented with the least number of bytes, that no 16-bit surrogate codes are present, and so forth. A buffer that passes all of those criteria might seem to be text, but you could be fooled.

In addition to the Raymond Chen discussion at Old New Thing cited by Mark Pim's answer, the buffer could actually contain x86 machine code that just happens to be restricted to the subset that seems to be 7-bit printable ASCII. Amazingly you actually can write meaningful programs in that subset, one example of which is the EICAR anti-virus test virus.

Of course, a buffer that contains byte sequences that are malformed UTF-8 is probably not UTF-8 text at all. In that case, you have a high degree of confidence. Then the trick is to figure out what encoding it might actually be.

If you know (or can assume) something about the semantic content of the buffer, then you could also use that to support your determination. For example, if the buffer is supposed to contain English text, then it is highly unlikely to have codepoints from Korean in it, and it should generally be spelled correctly, follow English grammar, and so forth. This can get expensive to test, of course...

Not reliably. See Raymond Chen's series of posts on the subject.

The problem is that UTF-8 without a BOM is all too often indistinguishable from equally valid ANSI encoding. I think most solutions (like the win32 API IsTextUnicode ) use various heuristics to give a best guess to the format of the text.

For quick and dirty, you can't do much better than the regex on this page . If you just want to know whether it's safe to decode the bytes as UTF-8, that's all you need.

Simply test that the byte sequence is valid as UTF-8. If it is, the probability of it being meaningful text in any other encoding is essentially zero.

Trimming UTF8 buffer

c++ check utf8 string contain specified characters

Check for invalid UTF8

UTF8 vs Wide Char?

Are there delimiter bytes for UTF8 characters?

fastcgipp < no output for utf8 characters

Convert UTF8 encoded byte buffer to wstring?

UTF8 char to hex value string

Handling the utf8 encoded char* array

UTF8 char array to std::wstring

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Trimming UTF8 buffer c++ check utf8 string contain specified characters Check for invalid UTF8 UTF8 vs Wide Char? Are there delimiter bytes for UTF8 characters? fastcgipp < no output for utf8 characters Convert UTF8 encoded byte buffer to wstring? UTF8 char to hex value string Handling the utf8 encoded char* array UTF8 char array to std::wstring

Related Tags

Check if a char* buffer contains UTF8 characters?

Question

4 answers

solution1
6 2009-08-05 08:47:48

solution2
4 ACCPTED 2009-08-05 08:41:31

solution3
0 2009-08-05 09:26:21

solution4
0 2011-05-24 02:39:43

Check if a char* buffer contains UTF8 characters?

Question

4 answers

solution1 6 2009-08-05 08:47:48

solution2 4 ACCPTED 2009-08-05 08:41:31

solution3 0 2009-08-05 09:26:21

solution4 0 2011-05-24 02:39:43

solution1
6 2009-08-05 08:47:48

solution2
4 ACCPTED 2009-08-05 08:41:31

solution3
0 2009-08-05 09:26:21

solution4
0 2011-05-24 02:39:43