简体   繁体   English

检测 Windows 中的文件中的 Unicode 10

[英]Detecting Unicode in files in Windows 10

Now Windows 10 Notepad does not require unicode files to have the BOM header and it does not encode the header by default.现在 Windows 10 记事本不需要 unicode 文件来获得 BOM header,默认情况下它不编码 Z0994EFF76440DBF39C。 This does break the existing code that checks the header to determine Unicode in files.这确实破坏了检查 header 以确定文件中的 Unicode 的现有代码。 How can I now tell in C++ if a file is in unicode?我现在如何在 C++ 中判断文件是否在 unicode 中? Source: https://www.bleepingcomputer.com/news/microsoft/windows-10-notepad-is-getting-better-utf-8-encoding-support/资料来源: https://www.bleepingcomputer.com/news/microsoft/windows-10-notepad-is-getting-better-utf-8-encoding-support/

The code we have to determine Unicode:我们要确定Unicode的代码:

int IsUnicode(const BYTE p2bytes[3])
{
        if( p2bytes[0]==0xEF && p2bytes[1]==0xBB p2bytes[2]==0xBF) 
            return 1; // UTF-8
        if( p2bytes[0]==0xFE && p2bytes[1]==0xFF)
            return 2;  // UTF-16 (BE)
        if( p2bytes[0]==0xFF && p2bytes[1]==0xFE) 
            return 3; // UTF-16 (LE)
            
        return 0;
}

If it's so much pain, why isn't there a typical function to determine the encoding?既然这么疼,为什么没有一个典型的function来确定编码呢?

You should use the W3C method, which it is something like:您应该使用 W3C 方法,它类似于:

  • if you know the encoding, use that如果您知道编码,请使用它

  • if there is a BOM, use it to determine the encoding如果有BOM,用它来确定编码

  • decode as UTF-8.解码为 UTF-8。 UTF-8 has strict byte sequence rules (which it is the purpose of UTF-8: being able to find the first byte of a character). UTF-8 具有严格的字节顺序规则(这是 UTF-8 的目的:能够找到字符的第一个字节)。 So if the file it is not UTF-8, very probably it will fail the decoding: on ANSI (cp-1252) it is not frequent to have accented letters followed by a symbols, and not at all probable that every time you have such sequence.因此,如果文件不是 UTF-8,很可能它会解码失败:在 ANSI (cp-1252) 上,重音字母后跟符号并不常见,而且每次你有这样的序列时都不太可能。 Latin-1: you may get control characters (instead of symbols), but it is also very seldom to have control characters C1 only after accented letters, and always C1 after accented characters. Latin-1:您可能会得到控制字符(而不是符号),但也很少有控制字符 C1 仅在重音字母之后,并且总是 C1 在重音字符之后。

  • if decoding fails (maybe you can just test first 4096 bytes, or 10 bytes above 127), use the standard 8-bit encoding of the OS (probably cp-1252 on windows).如果解码失败(也许您可以只测试前 4096 个字节,或 127 以上的 10 个字节),请使用操作系统的标准 8 位编码(可能是 Windows 上的 cp-1252)。

This method should work very well.这种方法应该很好用。 It is biased on UTF-8, but the world went to such directions long ago.它偏向于 UTF-8,但世界早就朝着这样的方向发展。 Determining which codepage is much more difficult.确定哪个代码页要困难得多。

You may add a step before the last step.您可以在最后一步之前添加一个步骤。 If there are various 00 bytes, you may be in a UTF-16 or UTF-32 form.如果有各种00字节,则可能是 UTF-16 或 UTF-32 格式。 Unicode requires that you know which form (eg from side channel), else the files should have a BOM. Unicode 要求您知道哪种形式(例如来自侧通道),否则文件应该有 BOM。 But you can guess the form (UTF-16LE, UTF-16BE, UTF-32LE, UTF32-BE) according the position of 00 in the file (new lines, and some ASCII characters are considered common scripts , so they are used in many scripts, so you should have many 00 ).但是你可以根据文件中00的position猜出形式(UTF-16LE, UTF-16BE, UTF-32LE, UTF32-BE)(换行,还有一些ASCII字符被认为是常见的脚本,所以在很多地方都会用到)脚本,所以你应该有很多00 )。

Now Windows 10 does not require unicode files to have the BOM header.现在 Windows 10 不需要 unicode 文件即可拥有 BOM header。

Windows never had this requirement. Windows 从来没有这个要求。 Every program can read text files like it wants to.每个程序都可以随意读取文本文件。

Maybe interesting: a BOM may not be desirable for UTF-8 because it breaks ASCII compatibility.可能很有趣: 对于 UTF-8 来说,BOM 可能并不理想,因为它破坏了 ASCII 兼容性。

This does break the existing code that checks the header to determine Unicode in files.这确实破坏了检查 header 以确定文件中的 Unicode 的现有代码。

This is a misunderstanding.这是一种误解。 Other code likely had Unicode support for a longer time than Notepad from Windows.其他代码可能比 Windows 的记事本支持 Unicode 的时间更长。

How can I now tell in C++ if a file is in unicode?我现在如何在 C++ 中判断文件是否在 unicode 中?

Typically you would check for the presence of a BOM and then use that information of course.通常,您会检查是否存在 BOM,然后当然会使用该信息。

Next you can try to read (the beginning of) the file with all possible encodings.接下来,您可以尝试使用所有可能的编码读取文件(开头)。 The ones that throw an exception are obviously not suitable.抛出异常的显然不合适。

From the remaining encodings, you could use a heuristic to determine the encoding.从剩余的编码中,您可以使用启发式方法来确定编码。

And if it still was the wrong choice, give the user an option to change the encoding manually.如果它仍然是错误的选择,请给用户一个手动更改编码的选项。 That's how it is done in many editors, like Notepad++.这就是在许多编辑器中完成的方式,例如 Notepad++。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM