简体   繁体   English

将Unicode存储在char中

[英]Storing Unicode in char

I have a program I made to test I/O from a terminal: 我有一个程序可以从终端测试I / O:

#include <stdio.h>
int main()
{
    char *input[100];
    scanf("%s", input);
    printf("%s", input);
    return 0;

}

It works as it should with ASCII characters, but it also works with Unicode characters and emoji. 它适用于ASCII字符,但是也适用于Unicode字符和表情符号。

Why is this? 为什么是这样?

Your code works because the input and output stream have the same encoding, and you do not do anything with c . 您的代码之所以有效,是因为输入和输出流具有相同的编码,并且您对c不执行任何操作。

Basically, you type something, which is converted into a sequence of bytes, which are then stored in c , then you send back that sequence of bytes to stdout which convert them back to readable characters. 基本上,您键入一些东西,然后将其转换为字节序列,然后将其存储在c ,然后将该bytes序列发送回stdout ,从而将它们转换回可读字符。

As long as the encoding and decoding process are compatible, you will get the "expected" result. 只要编码和解码过程兼容,您将获得“预期”结果。

Now, what happens if you try to use standard "string" C functions? 现在,如果您尝试使用标准的“字符串” C函数会怎样? Let's assume you typed "♠Hello" in your terminal, you will get the expected output but: 假设您在终端中键入“♠Hello”,您将获得预期的输出,但是:

strlen(c) -> 8
c[0] -> Some strange character
c[3] -> H

You see? 你看? You may be able to store whatever you want in a char array, it does not mean you should. 您可以将所需的任何内容存储在char数组中,但这并不意味着您应该这样做。 If you want to deal with extended character sets, use wchar_t instead. 如果要处理扩展字符集,请改用wchar_t

You're probably running on Linux, with your terminal set to UTF-8 so scanf produces UTF-8, and printf can output it. 您可能正在Linux上运行,并且终端设置为UTF-8,所以scanf生成UTF-8,而printf可以输出它。 UTF-8 is designed such that char[] can store it. UTF-8的设计使得char[]可以存储它。 I explicitly use char[] and not char because non-ASCII characters need more than one byte. 我明确使用char[]而不是char因为非ASCII字符需要多个字节。

Your program is undefined as it has undefined behavior. 您的程序未定义,因为它具有未定义的行为。

scanf("%s", input);

expects a pointer to string, but 需要一个指向字符串的指针,但是

char *input[100];

input is pointer to pointer to char , char * . input是指向charchar *指针。

Your program may work because the buffer you pass to scanf is of sufficient size to store unicode character and a characters you pass don't have a NULL byte in between them, but it may not work as well because the implementation of C on your (and any other) machine is allowed to do anything in cases of UB. 您的程序可能会正常工作,因为传递给scanf的缓冲区的大小足以存储Unicode字符,并且传递的字符之间没有NULL字节,但由于(在您的(在UB的情况下,允许任何其他计算机执行任何操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM