简体   繁体   English

使用unicode的std :: string的奇怪行为

[英]Strange behavior of std::string with unicode

I have the following piece of code: 我有以下代码:

#include <iostream>

std::string eps("ε");

int main()
{
    std::cout << eps << '\n';
    return 0;
}

Somehow it compiles with g++ and clang on Ubuntu, and even prints out right character ε . 不知何故,它在Ubuntu上用g ++和clang编译,甚至打印出正确的字符ε Also I have almost same piece of code which happily reads ε with cin into std::string . 此外,我有几乎相同的代码片段,用cin快乐地将ε读入std::string By the way, eps.size() is 2. 顺便说一句, eps.size()是2。

My question is - how that works? 我的问题是 - 它是如何工作的? How can we insert unicode character into std::string ? 我们如何将unicode字符插入到std::string My guess is that operating system handles all this work with unicode, but I'm not sure. 我的猜测是操作系统使用unicode处理所有这些工作,但我不确定。

EDIT 编辑

As with output, I understood that it is terminal who is responsible for showing me right character (ε in this case). 和输出一样,我知道终端负责向我展示正确的角色(在这种情况下为ε)。

But with input: cin reads symbols to ' ' or any other space character (and as I understand byte by byte). 但是输入:cin将符号读取到' '或任何其他空格字符(并且我逐字节理解)。 So, if I take Ƞ , which second byte is 32 ' ' it will read only first byte, and then stop. 所以,如果我取Ƞ ,哪个第二个字节是32 ' '它将只读取第一个字节,然后停止。 But it reads Ƞ . 但它写着Ƞ How? 怎么样?

The most likely reason is that everything is getting encoded in UTF-8 , as it does on my system: 最可能的原因是所有内容都以UTF-8编码,就像在我的系统上一样:

$ xxd test.cpp
...
0000020: 2065 7073 2822 ceb5 2229 3b0a 0a69 6e74   eps("..");..int
                        ^^^^ ε in UTF-8                 ^^ TWO bytes!
...
$ g++ -o test.out test.cpp
$ ./test.out 
ε
$ ./test.out | xxd
0000000: ceb5 0a
         ^^^^              

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM