简体   繁体   中英

Strange behavior of std::string with unicode

I have the following piece of code:

#include <iostream>

std::string eps("ε");

int main()
{
    std::cout << eps << '\n';
    return 0;
}

Somehow it compiles with g++ and clang on Ubuntu, and even prints out right character ε . Also I have almost same piece of code which happily reads ε with cin into std::string . By the way, eps.size() is 2.

My question is - how that works? How can we insert unicode character into std::string ? My guess is that operating system handles all this work with unicode, but I'm not sure.

EDIT

As with output, I understood that it is terminal who is responsible for showing me right character (ε in this case).

But with input: cin reads symbols to ' ' or any other space character (and as I understand byte by byte). So, if I take Ƞ , which second byte is 32 ' ' it will read only first byte, and then stop. But it reads Ƞ . How?

The most likely reason is that everything is getting encoded in UTF-8 , as it does on my system:

$ xxd test.cpp
...
0000020: 2065 7073 2822 ceb5 2229 3b0a 0a69 6e74   eps("..");..int
                        ^^^^ ε in UTF-8                 ^^ TWO bytes!
...
$ g++ -o test.out test.cpp
$ ./test.out 
ε
$ ./test.out | xxd
0000000: ceb5 0a
         ^^^^              

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM