简体   繁体   English

char vs wchar_t

[英]char vs wchar_t

I'm trying to print out a wchar_t* string. 我正在尝试打印出一个wchar_t *字符串。 Code goes below: 代码如下:

#include <stdio.h>
#include <string.h>
#include <wchar.h>

char *ascii_ = "中日友好";  //line-1
wchar_t *wchar_ = L"中日友好";  //line-2

int main()
{
    printf("ascii_: %s\n", ascii_);  //line-3
    wprintf(L"wchar_: %s\n", wchar_);  //line-4
    return 0;
}

//Output
ascii_: 中日友好

Question: 题:

  1. Apparently I should not assign CJK characters to char* pointer in line-1, but I just did it, and the output of line-3 is correct, So why? 显然我不应该将CJK字符分配给第1行中的char *指针,但我只是这样做了,第3行的输出是正确的,为什么呢? How could printf() in line-3 give me the non-ascii characters? 第3行中的printf()怎么能给我非ascii字符? Does it know the encoding somehow? 它以某种方式知道编码吗?

  2. I assume the code in line-2 and line-4 are correct, but why I didn't get any output of line-4? 我假设第2行和第4行的代码是正确的,但为什么我没有获得第4行的任何输出?

First of all, it's usually not a good idea to use non-ascii characters in source code. 首先,在源代码中使用非ascii字符通常不是一个好主意。 What's probably happening is that the chinese characters are being encoded as UTF-8 which works with ascii. 可能发生的是汉字被编码为UTF-8,与ascii一起使用。

Now, as for why the wprintf() isn't working. 现在,至于为什么wprintf()不起作用。 This has to do with stream orientation. 这与流方向有关。 Each stream can only be set to either normal or wide. 每个流只能设置为普通或宽。 Once set, it cannot be changed. 设置后,无法更改。 It is set the first time it is used. 它是在第一次使用时设置的。 (which is ascii due to the printf ). (由于printf ,它是ascii)。 After that the wprintf will not work due the incorrect orientation. 之后,由于方向不正确, wprintf将无法工作。

In other words, once you use printf() you need to keep on using printf() . 换句话说,一旦你使用printf()你需要继续使用printf() Similarly, if you start with wprintf() , you need to keep using wprintf() . 同样,如果从wprintf()开始,则需要继续使用wprintf()

You cannot intermix printf() and wprintf() . 你不能混合printf()wprintf() (except on Windows) (在Windows上除外)

EDIT: 编辑:

To answer the question about why the wprintf line doesn't work even by itself. 回答关于为什么wprintf线甚至wprintf工作的问题。 It's probably because the code is being compiled so that the UTF-8 format of 中日友好 is stored into wchar_ . 这可能是因为代码正在编译中,因此中日友好的UTF-8格式存储在wchar_ However, wchar_t needs 4-byte unicode encoding. 但是, wchar_t需要4字节的unicode编码。 (2-bytes in Windows) (Windows中的2个字节)

So there's two options that I can think of: 所以我可以想到两个选项:

  1. Don't bother with wchar_t , and just stick with multi-byte char s. 不要打扰wchar_t ,只需坚持使用多字节char This is the easy way, but may break if the user's system is not set to the Chinese locale. 这是一种简单的方法,但如果用户的系统未设置为中文语言环境,则可能会中断。
  2. Use wchar_t , but you will need to encode the Chinese characters using unicode escape sequences. 使用wchar_t ,但您需要使用unicode转义序列对中文字符进行编码。 This will obviously make it unreadable in the source code, but it will work on any machine that can print Chinese character fonts regardless of the locale. 这显然会使它在源代码中无法读取,但它可以在任何可以打印中文字符字体而不管语言环境的机器上工作。

Line 1 is not ascii, it's whatever multibyte encoding is used by your compiler at compile-time. 第1行不是ascii,它是编译器在编译时使用的任何多字节编码。 On modern systems that's probably UTF-8. 在现代系统上,可能是UTF-8。 printf does not know the encoding. printf不知道编码。 It's just sending bytes to stdout, and as long as the encodings match, everything is fine. 它只是向stdout发送字节,只要编码匹配,一切都很好。

One problem you should be aware of is that lines 3 and 4 together invoke undefined behavior. 您应该注意的一个问题是第3行和第4行一起调用未定义的行为。 You cannot mix character-based and wide-character io on the same FILE ( stdout ). 您不能在同一个FILEstdout )上混合基于字符和宽字符的io。 After the first operation, the FILE has an "orientation" (either byte or wide), and after that any attempt to perform operations of the opposite orientation results in UB. 在第一次操作之后, FILE具有“方向”(字节或宽),之后,任何执行相反方向操作的尝试都会产生UB。

You are omitting one step and therefore think the wrong way. 你省略了一步,因此想错了路。

You have a C file on disk, containing bytes. 磁盘上有一个C文件,包含字节。 You have a "ASCII" string and a wide string. 你有一个“ASCII”字符串和一个宽字符串。

The ASCII string takes the bytes exactly like they are in line 1 and outputs them. ASCII字符串采用与第1行完全相同的字节并输出它们。 This works as long as the encoding of the user's side is the same as the one on the programmer's side. 只要用户方的编码与程序员方的编码相同,这就有效。

The wide string first decodes the given bytes into unicode codepoints and stored in the program- maybe this goes wrong on your side. 宽字符串首先将给定的字节解码为unicode代码点并存储在程序中 - 这可能是你的错误。 On output they are encoded again according to the encoding on the user's side. 在输出时,它们根据用户侧的编码再次编码。 This ensures that these characters are emitted as they are intended to, not as they are entered. 这可以确保这些字符按照预期的方式发出,而不是输入它们。

Either your compiler assumes the wrong encoding, or your output terminal is set up the wrong way. 您的编译器假定编码错误,或者您的输出终端设置错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM