C ++中的数组中的多字节UTF-8

Question

I have been having trouble working with 3-byte Unicode UTF-8 characters in arrays. 我在使用数组中的3字节Unicode UTF-8字符时遇到了问题。 When they are in char arrays I get multi-character character constant and implicit constant conversion warnings, but when I use wchar_t arrays, wcout returns nothing at all. 当它们在char数组中时，我得到多字符字符常量和隐式常量转换警告，但是当我使用wchar_t数组时，wcout根本不返回任何内容。 Because of the nature of the project, it must be an array and not a string. 由于项目的性质，它必须是数组而不是字符串。 Below is an example of what I've been trying to do. 以下是我一直在努力做的一个例子。

#include <iostream>
#include <string>
using namespace std;
int main()
{
    wchar_t testing[40];
    testing[0] = L'\u0B95';
    testing[1] = L'\u0BA3';
    testing[2] = L'\u0B82';
    testing[3] = L'\0';
    wcout << testing[0] << endl;
    return 0;
}

Any suggestions? 有什么建议？ I'm working with OSX. 我正在使用OSX。

Answer 1

Since '\க' requires 3 bytes, it is considered a multicharacter literal . 由于'\க'需要3个字节，因此它被视为多字符文字 。 A multicharacter literal has type int and an implementation-defined value. 多字符文字具有int类型和实现定义的值。 (Actually, I don't think gcc is correct to do this ) （实际上，我不认为gcc是正确的）

Putting the L prefix before the literal makes it have type wchar_t and has an implementation defined value (it maps to a value in the execution wide-character set which is an implementation defined superset of the basic execution wide-character set ). 将L前缀放在文字之前使其具有类型wchar_t并具有实现定义值（它映射到执行宽字符集中的值，该值是基本执行宽字符集的实现定义超集）。

The C++11 standard provides us with some more Unicode aware types and literals. C ++ 11标准为我们提供了一些更多的Unicode感知类型和文字。 The additional types are char16_t and char32_t , whose values are the Unicode code-points that represent the character. 其他类型是char16_t和char32_t ，其值是表示字符的Unicode代码点。 They are analogous to UTF-16 and UTF-32 respectively. 它们分别类似于UTF-16和UTF-32。

Since you need character literals to store characters from the basic multilingual plane, you'll need a char16_t literal. 由于您需要字符文字来存储基本多语言平面中的字符，因此您需要一个char16_t文字。 This can be written as, for example, u'\க' . 这可以写成，例如， u'\க' 。 You can therefore write your code as follows, with no warnings or errors: 因此，您可以按如下方式编写代码，不会出现警告或错误：

char16_t testing[40];
testing[0] = u'\u0B95';
testing[1] = u'\u0BA3';
testing[2] = u'\u0B82';
testing[3] = u'\0';

Unfortunately, the I/O library does not play nicely with these new types. 不幸的是，I / O库不能很好地适应这些新类型。

If you do not truly require using character literals as above, you may make use of the new UTF-8 string literals: 如果你真的不需要使用上面的字符文字，你可以使用新的UTF-8字符串文字：

const char* testing = u8"\u0B95\u0BA3\u0B82";

This will encode the characters as UTF-8. 这会将字符编码为UTF-8。

C ++中的数组中的多字节UTF-8

问题描述

1 个解决方案

解决方案1
4 2012-11-24 23:41:36

C ++中的数组中的多字节UTF-8

问题描述

1 个解决方案

解决方案1 4 2012-11-24 23:41:36

解决方案1
4 2012-11-24 23:41:36