简体   繁体   中英

Detecting Unicode in files in Windows 10

Now Windows 10 Notepad does not require unicode files to have the BOM header and it does not encode the header by default. This does break the existing code that checks the header to determine Unicode in files. How can I now tell in C++ if a file is in unicode? Source: https://www.bleepingcomputer.com/news/microsoft/windows-10-notepad-is-getting-better-utf-8-encoding-support/

The code we have to determine Unicode:

int IsUnicode(const BYTE p2bytes[3])
{
        if( p2bytes[0]==0xEF && p2bytes[1]==0xBB p2bytes[2]==0xBF) 
            return 1; // UTF-8
        if( p2bytes[0]==0xFE && p2bytes[1]==0xFF)
            return 2;  // UTF-16 (BE)
        if( p2bytes[0]==0xFF && p2bytes[1]==0xFE) 
            return 3; // UTF-16 (LE)
            
        return 0;
}

If it's so much pain, why isn't there a typical function to determine the encoding?

You should use the W3C method, which it is something like:

  • if you know the encoding, use that

  • if there is a BOM, use it to determine the encoding

  • decode as UTF-8. UTF-8 has strict byte sequence rules (which it is the purpose of UTF-8: being able to find the first byte of a character). So if the file it is not UTF-8, very probably it will fail the decoding: on ANSI (cp-1252) it is not frequent to have accented letters followed by a symbols, and not at all probable that every time you have such sequence. Latin-1: you may get control characters (instead of symbols), but it is also very seldom to have control characters C1 only after accented letters, and always C1 after accented characters.

  • if decoding fails (maybe you can just test first 4096 bytes, or 10 bytes above 127), use the standard 8-bit encoding of the OS (probably cp-1252 on windows).

This method should work very well. It is biased on UTF-8, but the world went to such directions long ago. Determining which codepage is much more difficult.

You may add a step before the last step. If there are various 00 bytes, you may be in a UTF-16 or UTF-32 form. Unicode requires that you know which form (eg from side channel), else the files should have a BOM. But you can guess the form (UTF-16LE, UTF-16BE, UTF-32LE, UTF32-BE) according the position of 00 in the file (new lines, and some ASCII characters are considered common scripts , so they are used in many scripts, so you should have many 00 ).

Now Windows 10 does not require unicode files to have the BOM header.

Windows never had this requirement. Every program can read text files like it wants to.

Maybe interesting: a BOM may not be desirable for UTF-8 because it breaks ASCII compatibility.

This does break the existing code that checks the header to determine Unicode in files.

This is a misunderstanding. Other code likely had Unicode support for a longer time than Notepad from Windows.

How can I now tell in C++ if a file is in unicode?

Typically you would check for the presence of a BOM and then use that information of course.

Next you can try to read (the beginning of) the file with all possible encodings. The ones that throw an exception are obviously not suitable.

From the remaining encodings, you could use a heuristic to determine the encoding.

And if it still was the wrong choice, give the user an option to change the encoding manually. That's how it is done in many editors, like Notepad++.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM