简体   繁体   English

在不使用3rd party库的情况下基于C *中基于char *的多平台Unicode处理?

[英]Multi-platform Unicode handling based on char* in C without using 3rd party libraries?

The following are bare minimum examples (I know that eg UNICODE/_UNICODE should be defined) that I've found to work: 以下是我发现可以正常工作的最低限度示例(我知道应定义例如UNICODE / _UNICODE):

Linux: Linux:

#include <stdio.h>

int main() {
  char* str = "Rölf";
  printf("%s\n", str);
}

Windows: 视窗:

#include <stdio.h>
#include <locale.h>

int main() {
  setlocale(LC_ALL, "");
  wchar_t* str = L"Rölf";
  wprintf(L"%s\n", str);
}

Now, I've read that one way of going about it is to basically "just use UTF-8/char everywhere and worry about platform-specific conversion when you do API calls". 现在,我已经读到,解决该问题的一种方法是基本上“只在各处使用UTF-8 / char,并在进行API调用时担心特定于平台的转换”。

And that would be great - have users provide char* as input for my library and "simply" convert that. 那就太好了-让用户提供char *作为我的库的输入并“简单”转换它。 So I've tried the following snippet based on this example (I've also seen it in variations elsewhere). 因此,我根据此示例尝试了以下代码段(在其他地方也看到了它的变体)。 If this would actually work, it would be amazing. 如果这确实可行,那就太神奇了。 But it doesn't: 但事实并非如此:

  char* str = u8"Rölf";
  int len = mbstowcs(NULL, str, 0) + 1;
  wchar_t wstr[len];
  mbstowcs(wstr, str, len);
  wprintf(L"%s\n", wstr);

I've also stumbled across discussions about console fonts and whatnot being the cause of faulty rendering, so to demonstrate that this is not a console issue - the following doesn't work either (well - the L"" literal does. The converted u8 literal doesn't): 我也偶然发现了有关控制台字体以及渲染错误原因的讨论,因此,为了证明这不是控制台问题,以下内容也不起作用(嗯-L“”文字确实可以。转换后的u8文字不):

  MessageBoxW(NULL, wstr, L"Rölf", MB_OK);

在此处输入图片说明

Am I misunderstanding the conversion process? 我会误解转换过程吗? Is there a way to make to this work? 有办法进行这项工作吗? (Without using eg ICU) (不使用例如ICU)

The mbstowcs function converts from a string encoded in the current locale's encoding to wchar_t[] , not from UTF-8 (unless that encoding is UTF-8). mbstowcs函数将从以当前语言环境编码编码的字符串转换为wchar_t[] ,而不是UTF-8(除非编码为UTF-8)。 On post-April-beta-2018 versions of Windows 10 or later, you actually can fix Windows to use UTF-8 as the encoding for plain char[] strings either as a global setting, or presumably by calling _setmbcp(65001) . 在Windows 10的2018年4月-beta版或更高版本上,您实际上可以修复Windows以将UTF-8用作纯char[]字符串的编码,作为全局设置,或者大概通过调用_setmbcp(65001) Older versions of Windows explicitly forbid this however for dubious historical reasons. 但是,出于可疑的历史原因,较旧版本的Windows明确禁止这样做。

Anyway, you second version of the code which you called "Windows" should work on arbitrary systems if not for a bug in MSVC's wprintf that you worked around: they have the meanings of %ls and %s backwards for the wide stdio functions. 无论如何,您的第二版代码“ Windows”应该可以在任意系统上运行,如果不是由于您所解决的MSVC wprintf中的错误wprintf :对于宽泛的stdio函数,它们具有%ls%s的含义。 In standard C, you need %ls to format a wchar_t[] string. 在标准C中,您需要%ls来格式化wchar_t[]字符串。 But there's actually no reason to use wprintf there at all, and in fact wprintf is highly problematic because you can't mix it with byte-oriented stdio (doing so invokes undefined behavior). 但是实际上根本没有理由使用wprintf ,实际上wprintf存在很大问题,因为您不能将其与面向字节的stdio混合使用(这样做会引起未定义的行为)。 So better would be: 更好的是:

#include <stdio.h>
#include <locale.h>

int main() {
  setlocale(LC_ALL, "");
  wchar_t* str = L"Rölf";
  printf("%ls\n", str);
}

and this version should work correctly on Windows and standards-conforming C implementations, since for the byte-oriented printf functions, MSVC doesn't have the meaning of %s and %ls reversed. 并且此版本应在Windows和符合标准的C实现上正确运行,因为对于面向字节的printf函数,MSVC的含义不为%s%ls反转。

If you really want to, you can also use a variant of your third version of the code, but you can't use mbstowcs to convert from UTF-8 to wchar_t . 如果确实需要,也可以使用第三版代码的变体,但不能使用mbstowcs从UTF-8转换为wchar_t Instead you need to either: 相反,您需要:

  1. Assume wchar_t is Unicode-encoded, and convert from UTF-8 to Unicode codepoints with your own (or a third-party library's) UTF-8 decoder. 假设wchar_t是Unicode编码的,并使用您自己(或第三方库的)UTF-8解码器将UTF-8转换为Unicode代码点。 But this is a bad assumption, because MSVC is also non-conforming in that it uses UTF-16 for wchar_t (C explicitly forbids "multi- wchar_t -characters because the mb/wc APIs are inherently incompatible with them), not Unicode codepoint values (equivalent to UTF-32). 但这是一个错误的假设,因为MSVC也不符合标准,因为它对wchar_t使用UTF-16(C明确禁止使用“ multi- wchar_t -characters,因为mb / wc API本质上与它们不兼容”),而不是Unicode代码点值(相当于UTF-32)。

  2. Convert from UTF-8 to uchar32_t (UTF-32) with your own (or a third-party library's) UTF-8 decoder, then use c32rtomb to convert to wchar_t[] . 使用您自己的(或第三方库的)UTF-8解码器将UTF-8转换为uchar32_t (UTF-32),然后使用c32rtomb转换为wchar_t[]

  3. Use iconv (standard on POSIX systems; available as a third-party library on Windows) to convert directly from UTF-8 to wchar_t . 使用iconv (在POSIX系统上为标准;在Windows上作为第三方库提供)直接从UTF-8转换为wchar_t


UTF8 option for Windows 10, version 1803+ Windows 10版本1803+的UTF8选项

在此处输入图片说明

Thanks to Barmak Shemirani making me aware of MultiByteToWideChar , I've found a solution to this that is even C99 conform. 感谢Barmak Shemirani使我意识到MultiByteToWideChar ,我找到了甚至符合C99的解决方案。 (Which works on Windows 7 by the way) (顺便说一下,这可在Windows 7上使用)

Note that setlocale() is only necessary for console output to render correctly. 请注意,只有setlocale()才是控制台输出正确呈现所必需的。 I didn't use it to highlight that it doesn't seem to be needed for GUI-related API calls. 我没有用它来强调与GUI相关的API调用似乎不需要它。

#define UNICODE
#define _UNICODE

#include <stdio.h>
#include <windows.h>
//#include <locale.h>

wchar_t* toWide(char* str) {
  int wchars_num = MultiByteToWideChar(CP_UTF8, 0, str, -1, NULL, 0);

  wchar_t* wstr = (wchar_t*)malloc(sizeof(wchar_t) * wchars_num);
  MultiByteToWideChar(CP_UTF8, 0, str, -1, wstr, wchars_num);

  return wstr;
}

int main() {
  // For output in console to render correctly - as far as the font allows anyway...
  //setlocale(LC_ALL, "");

  // PLATFORM-AGNOSTIC DATA STRUCTURE WITH UTF-8 TEXT
  // (Usually not directly next to the platform-specific API calls...)
  char* str = "Rölf";

  // PLATFORM-SPECIFIC TEXT HANDLING
  wchar_t* wstr = toWide(str);
  printf("%ls\n", wstr);

  MessageBox(NULL, wstr, L"Rölf", MB_OK);
  free(wstr);
}

The way I use it is that I declare a data structure to be filled by my users where all text is char* and assumed to be UTF-8. 我使用它的方式是,我声明一个要由我的用户填充的数据结构,其中所有文本均为char *并假定为UTF-8。 Then in my library, I use platform-specific UI APIs. 然后在我的库中,使用特定于平台的UI API。 And in the case of Windows, doing the above UTF-16 conversion is obviously necessary. 对于Windows,显然必须进行上述UTF-16转换。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM