简体   繁体   English

Windows 上的 MinGW + GCC 和 UTF-8 字符

[英]MinGW + GCC on Windows and UTF-8 characters

I'm having a trouble with GCC compiler and Windows CMD because I can't see the UTF-8 characters correctly.我在使用 GCC 编译器和 Windows CMD 时遇到问题,因为我无法正确看到 UTF-8 字符。 I've the following code:我有以下代码:

#include <stdio.h>
#include <stdlib.h>

int main()
{
  char caractere;
  int inteiro;
  float Float;
  double Double;

  printf("Tipo de Dados\tNúmero de Bytes\tEndereço\n");
  printf("Caractere\t%d bytes \t em %d\n", sizeof(caractere), &caractere);
  printf("Inteiro\t%d bytes \t em %d\n", sizeof(inteiro), &inteiro);
  printf("Float\t%d bytes \t\t em %d\n", sizeof(Float), &Float);
  printf("Double\t%d bytes \t em %d\n", sizeof(Double), &Double);

  printf("Caractere: %d bytes \t em %p\n", sizeof(caractere), &caractere);
  printf("Inteiro: %d bytes \t em %p\n", sizeof(inteiro), &inteiro);
  printf("Float: %d bytes \t\t em %p\n", sizeof(Float), &Float);
  printf("Double: %d bytes \t em %p\n", sizeof(Double), &Double);

  return 0;
}

And then I run the following command:然后我运行以下命令:

gcc pointers01.c -o pointers

I don't get any compiling errors.我没有收到任何编译错误。 But when I execute the produced file (.exe) it doesn't show the UTF-8 characters:但是当我执行生成的文件 (.exe) 时,它不显示 UTF-8 字符:

Tipo de Dados   Número de Bytes    Endereço
Caractere   1 bytes      em 2686751
Inteiro 4 bytes      em 2686744
Float   4 bytes          em 2686740
Double  8 bytes      em 2686728
Caractere: 1 bytes   em 0028FF1F
Inteiro: 4 bytes     em 0028FF18
Float: 4 bytes       em 0028FF14
Double: 8 bytes      em 0028FF08

How do I do to resolve this problem?我该怎么做才能解决这个问题? Thank you.谢谢你。

Sadly, the Windows console has very limited and buggy support for UTF-8.遗憾的是,Windows 控制台对 UTF-8 的支持非常有限且有缺陷。

What can be done: Set the codepage to 65001 and use one of the fonts which are supporting it, eg.可以做什么:将代码页设置为65001并使用支持它的字体之一,例如。 "Lucida Console". “露西达控制台”。 The codepage can be set by the command chcp or, in C/C++, by the function SetConsoleOutputCP ;代码页可以由命令chcp设置,或者在 C/C++ 中,由函数SetConsoleOutputCP the font is set with SetCurrentConsoleFontEx .字体是用SetCurrentConsoleFontEx设置的。

However, there are some major (and minor) problems.但是,存在一些主要(和次要)问题。 Minor first:次要第一:

a) These functions are valid for one session, ie. a)这些功能对一个会话有效,即。 if you run the program again later, you have to set it again.如果您稍后再次运行该程序,则必须再次设置它。 Making it default is possible in theory, but not recommendable, because it will affect all console programs and introduce the problems below to them, even if they don´t do anything with codepages and are not written to mitigate the problems.将其设为默认值在理论上是可能的,但不推荐,因为它会影响所有控制台程序并将以下问题引入它们,即使它们对代码页没有做任何事情并且不是为了缓解问题而编写的。

b) If the console isn´t opened by the programn, but you´re starting it from an existing console, it will affect whatever runs after it, until this console is closed. b)如果控制台不是由程序打开,而是从现有控制台启动它,它将影响它之后运行的任何内容,直到此控制台关闭。 So you have to change it back to the default value before your own program exits.所以你必须在你自己的程序退出之前把它改回默认值。

c) Some functions usable for console input/output won´t work properly with CP65001. c)某些可用于控制台输入/输出的功能在 CP65001 上无法正常工作。
(that´s the most severe thing) (这是最严重的)

Unlike the whole UTF16 part of Windows, it partially treats UTF8 like any 1-byte charset, and does some strange things which just happened to fulfill the standard with 1byte charsets, but are implemented differently.与 Windows 的整个 UTF16 部分不同,它部分地将 UTF8 视为任何1 字节字符集,并做了一些奇怪的事情,这些事情恰好符合1 字节字符集的标准,但实现方式不同。

As an example, fread should return the number of bytes read (if called with size 1), but in Microsofts implementation, it does return the number of characters (UTF16 is an exception, but not UTF8).例如,fread 应该返回读取的字节数(如果调用大小为 1),但在微软的实现中,它确实返回字符数(UTF16 是一个例外,但不是 UTF8)。 With any normal codepage, it will work because 1char=1byte, but not with UTF8 ... wrong return value => wrong data processed对于任何正常的代码页,它都可以工作,因为 1char=1byte,但不能使用 UTF8 ...错误的返回值 => 处理了错误的数据

Another example, fflush can hang (at least is reported to, didn´t check).另一个例子,fflush 可以挂起(至少被报告了,没有检查)。 etc.etc.等等等等
And it doesn´t only affect standard C functions, but the direct Winapi calls too.它不仅会影响标准 C 函数,还会影响直接的 Winapi 调用。

d) As a result of c), all batch files with UTF-8 characters (except the normal ASCII range) won´t work properly, at least in some Windows versions (didn´t check each one, but it´s very likely that Win10 still has this bug. MS shows no intention to fix it anytime soon.) d)作为 c) 的结果,所有带有 UTF-8 字符(正常 ASCII 范围除外)的批处理文件将无法正常工作,至少在某些 Windows 版本中(没有检查每个版本,但很可能Win10 仍然有这个错误。MS 表示没有打算在短期内修复它。)

Some more reading for c and d: https://social.msdn.microsoft.com/Forums/vstudio/en-US/e4b91f49-6f60-4ffe-887a-e18e39250905/possible-bugs-in-writefile-and-crt-unicode-issues?forum=vcgeneral c 和 d 的更多阅读: https : //social.msdn.microsoft.com/Forums/vstudio/en-US/e4b91f49-6f60-4ffe-887a-e18e39250905/possible-bugs-in-writefile-and-crt- unicode-issues?forum=vcgeneral

我通常使用 Sublime Text 将源文件保存为 DOS( CP437 )并且它可以工作(至少对于小程序而言)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM