简体   繁体   中英

How to work with non-ascii characters in strings in C++?

When writing a program, I'm having issues working with a combination of special characters and regular ones. When I print either type to the console separately, they work fine, but when I print a special and normal character in the same line, it results in errored characters instead of the expected output. My code:

#include <fstream>
#include <iostream>
#include <string>

using namespace std;

void initCharacterMap(){
    const string normal = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890!@#$%^&*()-_[]{};':\",.<>/?";
    const string inverse = "∀𐐒Ↄ◖ƎℲ⅁HIſ⋊⅂WᴎOԀΌᴚS⊥∩ᴧMX⅄Zɐqɔpǝɟƃɥıɾʞʃɯuodbɹsʇnʌʍxʎz12Ɛᔭ59Ɫ860¡@#$%^⅋*)(-‾][}{؛,:„'˙></¿";

    cout << normal << endl;

    for(int i=0;i<normal.length();i++){
        cout << normal[i];
    }
    cout << endl;

    cout << inverse << endl;

    for(int i=0;i<inverse.length();i++){
        cout << inverse[i];
    }
    cout << endl;

    for(int i=0;i<inverse.length();i++){
        cout << normal[i] << inverse[i] << endl;
    }
}

int main() {
    initCharacterMap();
    return 0;
}

And the console output: https://paste.ubuntu.com/p/H9bqh67WPZ/

When viewed in console, the \\XX characters show up as unknown character symbol, and when I opened that log, I was warned that some characters couldn't be viewed and that editing could corrupt the file.

If anyone has any advice on how I can fix this, it would be greatly appreciated.

EDIT: After following the suggestion in Marek R's answer, the situation greatly improved, but this still isn't quite giving me the results I want. New code:

#include <fstream>
#include <iostream>
#include <string>

using namespace std;

void initCharacterMap(){
    const wchar_t normal[] = L"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890!@#$%^&*()-_[]{};':\",.<>/?";
    const wchar_t inverse[] = L"∀𐐒Ↄ◖ƎℲ⅁HIſ⋊⅂WᴎOԀΌᴚS⊥∩ᴧMX⅄Zɐqɔpǝɟƃɥıɾʞʃɯuodbɹsʇnʌʍxʎz12Ɛᔭ59Ɫ860¡@#$%^⅋*)(-‾][}{؛,:„'˙></¿";

    wcout << normal << endl;

    for(int i=0;i<sizeof(normal)/sizeof(normal[0]);i++){
        wcout << normal[i];
    }
    wcout << endl;

    wcout << inverse << endl;

    for(int i=0;i<sizeof(inverse)/sizeof(inverse[0]);i++){
        wcout << inverse[i];
    }
    wcout << endl;

    for(int i=0;i<sizeof(inverse)/sizeof(inverse[0]);i++){
        wcout << normal[i] << inverse[i] << endl;
    }
}

int main() {
    initCharacterMap();
    return 0;
}

New console output: https://paste.ubuntu.com/p/hcM7JB99zj/

So, I'm no longer having issues with using output of contents of the strings together, but the issue with it now is that all non-ascii characters are being replaced with question marks in the output. Is there any way to make those characters output properly?

Most probably you code is using UTF-8 encoding. This means that single character can occupy from one to 4 bytes. Note that that value of inverse.size() is bigger than you are expecting.

std::string doesn't know anything about encoding, so it treats each byte as a character. The output console is interpreting sequence of byres as done in respective encoding and shows proper characters.

When you print byte by byte each string separately it works since sequence is proper. When you print one byte from one string and one byte from other things get messy.

The easiest way to fix it is use std::wstring wchar_t and L"some literal" . It should work in your case, but as point out in comets below on some platforms some characters may not fit into single wide character. If you want to know more read about different character encoding.

The other way to solve your problem is to use a map which will transform sequence of bytes (string) to other sequence (string). C++11:

auto dictionary = std::unordered_map<std::string, std::string> {
    { "A", "∀" },
    { "B", "𐐒" },
    { "C", "Ↄ" },
    { "D", "◖" },
    … … …
}


Edit I've tested your new code and you should add code which configures locale for output stream.

On my mac (with polish locale), when building with clang, application ignores inverted values ( wcout goes into invalid state), but when locale is set everything works like you are expecting.

 #include <fstream> #include <iostream> #include <string> #include <locale> using namespace std; void initCharacterMap(){ wcout.imbue(locale("")); const auto normal = L"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890!@#$%^&*()-_[]{};':\\",.<>/?"s; const auto inverse = L"∀𐐒Ↄ◖ƎℲ⅁HIſ⋊⅂WᴎOԀΌᴚS⊥∩ᴧMX⅄Zɐqɔpǝɟƃɥıɾʞʃɯuodbɹsʇnʌʍxʎz12Ɛᔭ59Ɫ860¡@#$%^⅋*)(-‾][}{؛,:„'˙></¿"s; wcout << normal << endl; for(auto ch : normal){ wcout << ch; } wcout << endl; wcout << inverse << endl; for(auto ch : inverse){ wcout << ch; } wcout << endl; for(size_t i=0; i < inverse.length(); ++i){ wcout << normal[i] << inverse[i] << endl; } } int main() { initCharacterMap(); return 0; } 

https://wandbox.org/permlink/nTYi5RbZgZXclE5r

I'm suspecting that standard library in your compiler also doesn't know how to perform conversion with default locale, so it prints question marks instead actual charters. So add this two lines ( include and imbue ) and it should work. If not then provide information about your platform and compiler.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM