LLVM下的非ASCII wchar_t文字

Question

I've migrated an Xcode iOS project from Xcode 3.2.6 to 4.2. 我已将Xcode iOS项目从Xcode 3.2.6迁移到4.2。 Now I'm getting warnings when I try to initialize a wchar_t with a literal with a non-ASCII character: 现在，当我尝试使用非ASCII字符的文字初始化wchar_t时，我收到警告：

wchar_t c1;
if(c1 <= L'я') //That's Cyrillic "ya"

The messages are: 消息是：

MyFile.cpp:148:28: warning: character unicode escape sequence too long for its type [2] MyFile.cpp:148:28: warning: extraneous characters in wide character constant ignored [2] MyFile.cpp：148：28：警告：字符unicode转义序列的类型太长[2] MyFile.cpp：148：28：警告：宽字符常量中的无关字符被忽略[2]

And the literal does not work as expected - the comparison misfires. 文字不能按预期工作 - 比较失败。

I'm compiling with -fshort-wchar, the source file is in UTF-8. 我正在使用-fshort-wchar进行编译，源文件是UTF-8。 The Xcode editor displays the file fine. Xcode编辑器显示文件正常。 It compiled and worked on GCC (several flavors, including Xcode 3), worked on MSVC. 它在GCC上编译和工作（包括Xcode 3在内的几种风格），在MSVC上工作。 Is there a way to make LLVM compiler recognize those literals? 有没有办法让LLVM编译器识别这些文字？ If not, can I go back to GCC in Xcode 4? 如果没有，我可以回到Xcode 4中的GCC吗？

EDIT: Xcode 4.2 on Snow Leopard - long story why. 编辑：Snow Leopard上的Xcode 4.2 - 长话故事。

EDIT2: confirmed on a brand new project. EDIT2：在一个全新的项目上确认。 File extension does not matter - same behavior in .m files. 文件扩展名无关紧要 - .m文件中的行为相同。 -fshort-wchar does not affect it either. -fshort-wchar也不会影响它。 Looks like I've gotta go back to GCC until I can upgrade to a version of Xcode where this is fixed. 看起来我必须回到GCC，直到我可以升级到修复的Xcode版本。

Answer 1

Not an answer, but hopefully helpful information — I could not reproduce the problem with clang 4.0 (Xcode 4.5.1): 不是答案，但希望有用的信息 - 我无法用clang 4.0（Xcode 4.5.1）重现问题：

$ uname -a
Darwin air 12.2.0 Darwin Kernel Version 12.2.0: Sat Aug 25 00:48:52 PDT 2012; root:xnu-2050.18.24~1/RELEASE_X86_64 x86_64
$ env | grep LANG
LANG=en_US.UTF-8
$ clang -v
Apple clang version 4.0 (tags/Apple/clang-421.0.60) (based on LLVM 3.1svn)
Target: x86_64-apple-darwin12.2.0
Thread model: posix
$ cat test.c
#include <stdio.h>
#include <stdlib.h>

int main(void)
{
    wchar_t c1 = 0;
    printf("sizeof(c1) == %lu\n", sizeof(c1));
    printf("sizeof(L'Я') == %lu\n", sizeof(L'Я'));
    if (c1 < L'Я') {
        printf("Я люблю часы Заря!\n");
    } else {
        printf("Что за....?\n");
    }
    return EXIT_SUCCESS;
}

$ clang -Wall -pedantic ./test.c 
$ ./a.out 
sizeof(c1) == 4
sizeof(L'Я') == 4
Я люблю часы Заря!
$ clang -Wall -pedantic ./test.c -fshort-wchar
$ ./a.out 
sizeof(c1) == 2
sizeof(L'Я') == 2
Я люблю часы Заря!
$

The same behavior is observed with clang++ (where wchar_t is built-in type). 使用clang ++（其中wchar_t是内置类型）观察到相同的行为。

Answer 2

I dont have an answer to your specific question, but wanted to point out that llvm-gcc has been permanently discontinued. 我没有回答你的具体问题，但是想指出llvm-gcc已被永久停用。 In my experience in dealing with delta's between Clang and llvm-gcc, and gcc, Clang is often correct with regards to the C++ specification even if that behavior is surprising. 根据我处理Clang和llvm-gcc以及gcc之间delta的经验，Clang在C ++规范方面经常是正确的，即使这种行为令人惊讶。

Answer 3

If in fact the source is UTF-8 then this isn't correct behavior. 如果实际上源是UTF-8那么这是不正确的行为。 However I can't reproduce the behavior in the most recent version of Xcode 但是，我无法重现最新版本的Xcode中的行为

MyFile.cpp:148:28: warning: character unicode escape sequence too long for its type [2] MyFile.cpp：148：28：警告：字符unicode转义序列的类型太长[2]

This error should be refering to a 'Universal Character Name' (UCN), which looks like "\\U001012AB" or "\Ѓ". 此错误应引用“通用字符名称”（UCN），其类似于“\\ U001012AB”或“\\ u0403”。 It indicates that the value represented by the escape sequence is larger than the enclosing literal type is capable of holding. 它表示转义序列表示的值大于封闭的文字类型能够容纳的值。 For example if the codepoint value requires more than 16 bits then a 16 bit wchar_t will not be able to hold the value. 例如，如果代码点值需要超过16位，那么16位wchar_t将无法保存该值。

MyFile.cpp:148:28: warning: extraneous characters in wide character constant ignored [2] MyFile.cpp：148：28：警告：宽字符常量中的无关字符被忽略[2]

This indicates that the compiler thinks there's more than one codepoint represented inside a wide character literal. 这表明编译器认为在宽字符文字中表示了多个代码点。 Eg L'ab' . 例如L'ab' 。 The behavior is implementation defined and both clang and gcc simply use the last codepoint value. 行为是实现定义的，clang和gcc都只使用最后一个代码点值。

The code you show shouldn't trigger either of these, at least in clang. 你展示的代码不应该触发其中任何一个，至少在clang中。 The first because that applies only to UCNs, let alone the fact that 'я' fits easily within a single 16-bit wchar_t; 第一个因为它仅适用于UCN，更不用说“я”很容易适合单个16位wchar_t; and the second because he source code encoding is always taken to be UTF-8 and it will see the UTF-8 multibyte representation of 'я' as a single codepoint. 第二个因为他的源代码编码总是被认为是UTF-8，它会将'я'的UTF-8多字节表示看作单个代码点。

You might recheck and ensure that the source actually is UTF-8. 您可能会重新检查并确保源实际上是UTF-8。 Then you should check that you're using an up-to-date version of Xcode. 然后你应该检查你是否使用了最新版本的Xcode。 You can also try switching the compiler in your project settings > Compile for C/C++/Objective-C 您还可以尝试在项目设置> Compile for C / C ++ / Objective-C中切换编译器

LLVM下的非ASCII wchar_t文字

问题描述

3 个解决方案

解决方案1
2

解决方案2
1 2012-10-26 03:36:57

解决方案3
1 已采纳 2012-10-26 04:27:05

LLVM下的非ASCII wchar_t文字

问题描述

3 个解决方案

解决方案1 2

解决方案2 1 2012-10-26 03:36:57

解决方案3 1 已采纳 2012-10-26 04:27:05

解决方案1
2

解决方案2
1 2012-10-26 03:36:57

解决方案3
1 已采纳 2012-10-26 04:27:05