简体   繁体   中英

C++ output Unicode in variable

I'm trying to output a string containing unicode characters, which is received with a curl call. Therefore, I'm looking for something similar to u8 and L options for literal strings, but than applicable for variables. Eg:

const char *s  = u8"\u0444";

However, since I have a string containing unicode characters, such as:

mit freundlichen Grüßen

When I want to print this string with:

cout << UnicodeString << endl;

it outputs:

mit freundlichen Gr??en

When I use wcout , it returns me:

mit freundlichen Gren

What am I doing wrong and how can I achieve the correct output. I return the output with RapidJSON , which returns the string as:

mit freundlichen Gr��en

Important to note, the application is a CGI running on Ubuntu, replying on browser requests

On my system the following produces the correct output. Try it on your system. I am confident that it will produce similar results.

#include <string>
#include <iostream>
using namespace std;

int main()
{
    string s="mit freundlichen Grüßen";
    cout << s << endl;
    return 0;
}

If it is ok, then this points to the web transfer not being 8-bit clean.

Mike.

containing unicode characters

You forgot to specify which unicode encoding does the string contain. There is the "narrow" UTF-8, which can be stored in a std::string and printed using std::cout , as well as wider variants, which can't. It is crucial to know which encoding you're dealing with. For the remainder of my answer, I'm going to assume you want to use UTF-8.


When I want to print this string with:

 cout << UnicodeString << endl; 

EDIT:

Important to note, the application is a CGI running on Ubuntu, replying on browser requests

The concerns here are slightly different from printing onto a terminal.

  1. You need to set the Content-Type response header appropriately or else the client cannot know how to interpret the response. For example Content-Type: application/json; charset=utf-8 Content-Type: application/json; charset=utf-8 .
  2. You still need to make sure that the source string is in fact the correct encoding corresponding to the header. See the old answer below for overview.
  3. The browser has to support the encoding. Most modern browsers have had support for UTF-8 a long time now.

Answer regarding printing to terminal:

Assuming that

  1. UnicodeString indeed contains an UTF-8 encoded string
  2. and that the terminal uses UTF-8 encoding
  3. and the font that the terminal uses has the graphemes that you use

the above should work.

it outputs:

 mit freundlichen Gr??en 

Then it appears that at least one of the above assumptions don't hold.

Whether 1. is true, you can verify by inspecting the numeric value of each code unit separately and comparing it to what you would expect of UTF-8. If 1. isn't true, then you need to figure out what encoding does the string actually use, and either convert the encoding, or configure the terminal to use that encoding.

  1. The terminal typically, but not necessarily, uses the system native encoding. The first step of figuring out what encoding your terminal / system uses is to figure out what terminal / system you are using in the first place. The details are probably in a manual.

    If the terminal doesn't use UTF-8, then you need to convert the UFT-8 string within your program into the character encoding that the terminal does use - unless that encoding doesn't have the graphemes that you want to print. Unfortunately, the standard library doesn't provide arbitrary character encoding conversion support (there is some support for converting between narrow and wide unicode, but even that support is deprecated). You can find the unicode standard here , although I would like to point out that using an existing conversion implementation can save a lot of work.

    In the case the character encoding of the terminal doesn't have the needed grapehemes - or if you don't want to implement encoding conversion - is to re-configure the terminal to use UTF-8. If the terminal / system can be configured to use UTF-8, there should be details in the manual.

  2. You should be able to test if the font itself has the required graphemes simply by typing the characters into the terminal and see if they show as they should - although, this test will also fail if the terminal encoding does not have the graphemes, so check that first. Manual of your terminal should explain how to change the font, should it be necessary. That said, I would expect üß to exist in most fonts.

If you are on Windows, what I would suggest is using Unicode UTF-16 at the Windows boundary.

It seems to me that on Windows with Visual C++ (at least up to VS2015) std::cout cannot output UTF-8-encoded-text, but std::wcout correctly outputs UTF-16 -encoded text.

This compilable code snippet correctly outputs your string containing German characters:

#include <fcntl.h>  
#include <io.h>  
#include <iostream>

int main()
{
    _setmode(_fileno(stdout), _O_U16TEXT);

    // ü : U+00FC
    // ß : U+00DF
    const wchar_t * text = L"mit freundlichen Gr\u00FC\u00DFen";

    std::wcout << text << L'\n';
}

Note the use of a UTF-16-encoded wchar_t string.

在此处输入图片说明


On a more general note, I would suggest you using the UTF-8 encoding (and for example storing text in std::string s) in your cross-platform C++ portions of code, and convert to UTF-16-encoded text at the Windows boundary.

To convert between UTF-8 and UTF-16 you can use Windows APIs like MultiByteToWideChar and WideCharToMultiByte . These are C APIs, that can be safely and conveniently wrapped in C++ code (more details can be found in this MSDN article , and you can find compilable C++ code here on GitHub ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM