简体   繁体   English

MSVC14 根据来源是 UTF-8 还是 UTF-8 BOM 对 u8 前缀进行不同的处理

[英]MSVC14 treats the u8 prefix differently depending on whether the source is UTF-8 or UTF-8 BOM

I was experimenting with UTF-8 and Qt and encountered a weird issue, so I investigated.我在试验 UTF-8 和 Qt 时遇到了一个奇怪的问题,所以我进行了调查。 I have created a simple program that prints bytes in const char[] literals:我创建了一个简单的程序,它在const char[]文字中打印字节:

#include <cstdio>

const char* koshka = "кошка";
const char* utf8_koshka = u8"кошка";

void printhex(const char* str)
{
    for (; *str; ++str)
    {
        printf("%02X ", *str & 0xFF);
    }
    puts("");
}

int main(int argc, char *argv[])
{
    printhex(koshka);
    printhex(utf8_koshka);

    return 0;
}

If we save the file as UTF-8 with BOM, then run it from Visual Studio 2015, this will be printed:如果我们使用 BOM 将文件保存为 UTF-8,然后从 Visual Studio 2015 运行它,将打印:

3F 3F 3F 3F 3F
D0 BA D0 BE D1 88 D0 BA D0 B0

While I don't really understand where the first string came from, the second is exactly what is should be, according to this UTF-8 encoding table .虽然我不太明白第一个字符串是从哪里来的,但根据这个 UTF-8 编码表,第二个字符串应该是什么。

If the exact same code is saved as UTF-8 without BOM, this is the output:如果完全相同的代码保存为没有 BOM 的 UTF-8,则输出如下:

D0 BA D0 BE D1 88 D0 BA D0 B0 
C3 90 C2 BA C3 90 C2 BE C3 91 CB 86 C3 90 C2 BA C3 90 C2 B0

So while it causes the unprefixed const char[] literal to be saved in the binary as UTF8, it breaks the u8 prefix for some reason.因此,虽然它导致无前缀的const char[]文字以 UTF8 u8保存在二进制文件中,但由于某种原因,它破坏了u8前缀。

If, however, we force the execution charset using #pragma execution_character_set("utf-8") , both strings are printed as D0 BA D0 BE D1 88 D0 BA D0 B0 in both cases (UTF-8 with and without BOM).但是,如果我们使用#pragma execution_character_set("utf-8")强制执行字符集,则在两种情况下(带和不带 BOM 的 UTF-8 #pragma execution_character_set("utf-8") ,两个字符串都打印为D0 BA D0 BE D1 88 D0 BA D0 B0

I've used Notepad++ to convert between the encodings.我使用 Notepad++ 在编码之间进行转换。

What is going on?到底是怎么回事?


EDIT:编辑:

Alan's answer explains the cause of this behavior, but I'd like to add a word of warning. Alan 的回答解释了这种行为的原因,但我想补充一句警告。 I've run into this issue while using Qt Creator to develop a Qt 5.5.1 application.我在使用 Qt Creator 开发 Qt 5.5.1 应用程序时遇到了这个问题。 In 5.5.1, the QString (const char*) constructor will assume the given string is encoded as UTF-8, and so will end up calling QString::fromUtf8 to construct the object.在 5.5.1 中, QString (const char*)构造函数将假定给定的字符串编码为 UTF-8,因此最终会调用QString::fromUtf8来构造对象。 However, Qt Creator (by default) saves every file as UTF without BOM;但是,Qt Creator(默认情况下)将每个文件都保存为 UTF,没有 BOM; this causes MSVC to misinterpret the source input as MBCS, exactly what has happened in this case, so under the default settings, the following will work:这会导致 MSVC 将源输入误解为 MBCS,这正是在这种情况下发生的情况,因此在默认设置下,以下内容将起作用:

QMessageBox::information(0, "test", "кошка");

and this will fail (mojibake):这将失败(mojibake):

QMessageBox::information(0, "test", u8"кошка");

A solution would be to enable the BOM in Tools -> Options -> Text Editor.一个解决方案是在工具 -> 选项 -> 文本编辑器中启用 BOM。 Note that this only applied to MSVC 2015 (or actually 14.0);请注意,这仅适用于 MSVC 2015(或实际上 14.0); older versions have less/no C++11 support, and u8 simply doesn't exist there, so if you're working with Qt on an older version, your best bet is to rely on the compiler getting confused by the lack of the BOM.旧版本对 C++11 的支持较少/没有,而且u8根本不存在,所以如果您在旧版本上使用 Qt,最好的办法是依靠编译器因缺少物料清单。

The compiler doesn't know what the encoding of the file is.编译器不知道文件的编码是什么。 It attempts to guess by looking at a prefix of the input.它试图通过查看输入的前缀来猜测。 If it sees a UTF-8 encoded BOM then it assumes it is dealing with UTF-8.如果它看到 UTF-8 编码的 BOM,则它假定它正在处理 UTF-8。 In the absence of that, and of any obvious UTF-16 characters, it defaults to something else.如果没有那个,也没有任何明显的 UTF-16 字符,它默认为其他东西。 (ISO Latin 1? Whatever the common local MBCS is?) (ISO 拉丁语 1?无论常见的本地 MBCS 是什么?)

Without the BOM the compiler fails to determine your input is UTF-8 encoded and so assumes it isn't.如果没有 BOM,编译器将无法确定您的输入是 UTF-8 编码的,因此假设它不是。

It then sees each byte of the UTF-8 encoding as a single character;然后它将 UTF-8 编码的每个字节视为单个字符; for the simple literal it is copied across verbatim, and for the u8 string it is encoded as UTF-8, giving the double encoding you see.对于简单的文字,它是逐字复制的,对于 u8 字符串,它被编码为 UTF-8,给出了你看到的双重编码。

The only solution seems to be to force the BOM;唯一的解决方案似乎是强制 BOM; alternatively, use UTF-16 which is really what the Windows platform prefers.或者,使用 UTF-16 这确实是 Windows 平台更喜欢的。

See also Specification of source charset encoding in MSVC++, like gcc "-finput-charset=CharSet" .另请参阅MSVC++ 中源字符集编码的规范,例如 gcc "-finput-charset=CharSet"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM