C ++ Windows十进制转换为UTF-8字符

Question

I've been using the function below to convert from the decimal representation of unicode characters to the UTF8 character itself in C++. 我一直在使用下面的函数将Unicode字符的十进制表示形式转换为C ++中的UTF8字符本身。 The function I have at the moment works well on Linux / Unix system but it keeps returning the wrong characters on Windows. 我目前拥有的功能在Linux / Unix系统上运行良好，但在Windows上始终返回错误的字符。

void GetUnicodeChar(unsigned int code, char chars[5]) {
    if (code <= 0x7F) {
        chars[0] = (code & 0x7F); chars[1] = '\0';
    } else if (code <= 0x7FF) {
        // one continuation byte
        chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[0] = 0xC0 | (code & 0x1F); chars[2] = '\0';
    } else if (code <= 0xFFFF) {
        // two continuation bytes
        chars[2] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[0] = 0xE0 | (code & 0xF); chars[3] = '\0';
    } else if (code <= 0x10FFFF) {
        // three continuation bytes
        chars[3] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[2] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[0] = 0xF0 | (code & 0x7); chars[4] = '\0';
    } else {
        // unicode replacement character
        chars[2] = 0xEF; chars[1] = 0xBF; chars[0] = 0xBD;
        chars[3] = '\0';
    }
}

Can anyone provide an alternative function or a fix for the current function I'm using that will work on Windows? 谁能在Windows上提供我正在使用的替代功能或当前功能的修复程序？

--UPDATE-- -更新-

INPUT: 225
OUTPUT ON OSX: á
OUTPUT ON WINDOWS: ├í

Answer 1

You don't show your code for printing, but presumably you're doing something like this: 您没有显示要打印的代码，但是大概您正在执行以下操作：

char s[5];
GetUnicodeChar(225, s);
std::cout << s << '\n';

The reason you're getting okay output on OS X and bad output on Windows is because OS X uses UTF-8 as the default encoding and Windows uses some legacy encoding. 之所以在OS X上可以正常输出而在Windows上却输出不佳，是因为OS X使用UTF-8作为默认编码，而Windows使用了一些旧式编码。 So when you output UTF-8 on OS X, OS X assumes (correctly) that it's UTF-8 and displays it as such. 因此，当您在OS X上输出UTF-8时，OS X假定（正确）它是UTF-8并显示为UTF-8。 When you output UTF-8 on Windows, Windows assumes (incorrectly) that it's some other encoding. 在Windows上输出UTF-8时，Windows会（错误地）认为它是其他某种编码。

You can simulate the problem on OS X using the iconv program with the following command in Terminal.app 您可以使用Terminal.app中的以下命令使用iconv程序在OS X上模拟问题

iconv -f cp437 -t utf8 <<< "á"

This takes the UTF-8 string, reinterprets it as a string encoded using Windows code page 437, and converts that to UTF-8 for display. 这将采用UTF-8字符串，将其重新解释为使用Windows代码页437编码的字符串，然后将其转换为UTF-8进行显示。 The output on OS X is ├í . OS X上的输出为├í 。

For testing small things you can do the following to properly display UTF-8 data on Windows. 为了测试小事情，您可以执行以下操作以在Windows上正确显示UTF-8数据。

#include <Wincon.h>

#include <cstdio>

char s[5];
GetUnicodeChar(225, s);

SetConsoleOutputCP(CP_UTF8);
std::printf("%s\n", s);

Also, parts of Windows' implementation of the standard library don't support output of UTF-8, so even after you change the output encoding code like std::cout << s still won't work. 另外，Windows的标准库实现的某些部分不支持UTF-8的输出，因此即使更改了输出编码代码（如std::cout << s仍然无法使用。

On a side note, taking an array as a parameter like this: 附带说明一下，将数组作为这样的参数：

void GetUnicodeChar(unsigned int code, char chars[5]) {

is a bad idea. 是个坏主意。 This will not catch mistakes such as: 这不会发现以下错误：

char *s; GetUnicodeChar(225, s);
char s[1]; GetUnicodeChar(225, s);

You can avoid these specific problems by changing the function to take a reference to an array instead: 您可以通过更改函数以引用数组来避免这些特定的问题：

void GetUnicodeChar(unsigned int code, char (&chars)[5]) {

However in general I'd recommend just avoiding raw arrays altogether. 但是总的来说，我建议您完全避免使用原始数组。 You can use std::array if you really want an array. 如果确实需要数组，可以使用std::array array。 You can use std::string if you want text, which IMO is a good choice here: 如果需要文本，可以使用std::string ，在这里IMO是一个不错的选择：

std::string GetUnicodeChar(unsigned int code);

Answer 2

The function is correct. 功能正确。 The output presumably isn't, which means there's a bug in that routine. 大概没有输出，这意味着该例程中存在一个错误。 But you don't show it. 但是您不显示它。 I'll bet that you're assuming that Windows can print UTF-8. 我敢打赌，您假设Windows可以打印UTF-8。

C ++ Windows十进制转换为UTF-8字符

问题描述

2 个解决方案

解决方案1
5 已采纳 2014-05-05 17:57:47

解决方案2
2 2014-05-05 16:17:26

C ++ Windows十进制转换为UTF-8字符

问题描述

2 个解决方案

解决方案1 5 已采纳 2014-05-05 17:57:47

解决方案2 2 2014-05-05 16:17:26

解决方案1
5 已采纳 2014-05-05 17:57:47

解决方案2
2 2014-05-05 16:17:26