简体   繁体   中英

How to convert a text like “\320\272\320\276\320\274…” to std::wstring in C++?

I am working on a code that processes message from Ubuntu, some of the messages contains, for example:

localhost sshd 1658 - - Invalid user \\320\\272\\320\\276\\320\\274\\320\\274\\321\\320\\275\\320\\270\\320\\267\\320\\274 from 172.28.60.28 port 50712 ]

where "\\320\\272\\320\\276\\320\\274\\320\\274\\321\\320\\275\\320\\270\\320\\267\\320\\274" is the user name that originally is in Russian. How to convert it to std::wstring?

The numbers after the backslashes are the UTF-8 byte sequence values of the Cyrillic letters, each byte represented as an octal number.

You could for example use a regex replace to replace each \\ooo with its value so that you get a real UTF-8 string out:

See it on Wandbox

#include <iostream>
#include <string>
#include <boost/regex.hpp>

int main()
{
    std::string const source = R"(Invalid user \320\272\320\276\320\274\320\274\321\320\275\320\270\320\267\320\274 from 172.28.60.28 port 50712)";
    boost::regex const re(R"(\\\d\d\d)");

    auto const replacer = [](boost::smatch const& match, auto it) {
        auto const byteVal = std::stoi(&match[0].str()[1], 0, 8);
        *it = static_cast<char>(byteVal);
        return ++it;
    };
    std::string const out = boost::regex_replace(source, re, replacer);

    std::cout << out << std::endl;
    return EXIT_SUCCESS;
}

If you really need to, you can then convert this std::string to std::wstring using eg Thomas 's method.

If you have a std::string containing UTF-8 code-points and you wish to convert this to std::wstring you can do this in the following way, using the std::codecvt_utf8 facet and the std::wstring_convert class template:

#include <locale>
std::wstring convert(const std::string& utf8String) {
    std::wstring_convert<std::codecvt_utf8<wchar_t>> converter{};
    return converter.from_bytes(utf8String);
}

The format of the resulting std::wstring will either be UCS2 (on Windows platforms) or UCS4 (most non-Windows platforms).

Note, that the std::codecvt_utf8 facet is deprecated as of C++17, and instead consumers are encouraged to rely on specialized unicode/text-processing libraries. But this should suffice for now.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM