简体   繁体   English

在Linux中将std :: string转换为Unicode

[英]Convert std::string to Unicode in Linux

EDIT I modified the question after realizing it was wrong to begin with. 编辑我在意识到开始之后是错误的时候修改了这个问题。

I'm porting part of a C# application to Linux, where I need to get the bytes of a UTF-16 string: 我正在将一部分C#应用程序移植到Linux,我需要获取UTF-16字符串的字节:

string myString = "ABC";
byte[] bytes = Encoding.Unicode.GetBytes(myString);

So that the bytes array is now: 这样bytes数组现在是:

"65 00 66 00 67 00" (bytes)

How can I achieve the same in C++ on Linux? 如何在Linux上用C ++实现相同的功能呢? I have a myString defined as std::string , and it seems that std::wstring on Linux is 4 bytes? 我有一个myString定义为std::string ,似乎Linux上的std::wstring是4个字节?

You question isn't really clear, but I'll try to clear up some confusion. 你的问题不是很清楚,但我会试着澄清一些困惑。

Introduction 介绍

Status of the handling of character set in C (and that was inherited by C++) after the '95 amendment to the C standard. 在'95修订C标准之后,在C中处理字符集(以及由C ++继承)的状态。

  • the character set used is given by the current locale 使用的字符集由当前语言环境给出

  • wchar_t is meant to store code point wchar_t用于存储代码点

  • char is meant to store a multibyte encoded form (a constraint for instance is that characters in the basic character set must be encoded in one byte) char用于存储多字节编码形式(例如,约束是基本字符集中的字符必须以一个字节编码)

  • string literals are encoded in an implementation defined manner. 字符串文字以实现定义的方式编码。 If they use characters outside of the basic character set, you can't assume they are valid in all locale. 如果它们使用基本字符集之外的字符,则不能假定它们在所有语言环境中都有效。

Thus with a 16 bits wchar_t you are restricted to the BMP. 因此,使用16位wchar_t您只能使用BMP。 Using the surrogates of UTF-16 is not compliant but I think MS and IBM are more or less forced to do this because they believed Unicode when they said they'll forever be a 16 bits charset. 使用UTF-16的代理是不合规的,但我认为MS和IBM或多或少被迫这样做,因为他们认为Unicode时他们永远是一个16位字符集。 Those who delayed their Unicode support tend to use a 32 bits wchar_t. 那些延迟了他们的Unicode支持的人倾向于使用32位wchar_t。

Newer standards don't change much. 较新的标准变化不大。 Mostly there are literals for UTF-8, UTF-16 and UTF-32 encoded strings and there are types for 16 bits and 32 bits char. 主要有UTF-8,UTF-16和UTF-32编码字符串的文字,有16位和32位字符的类型。 There is little or no additional support for Unicode in the standard libraries. 标准库中很少或没有对Unicode的额外支持。

How to do the transformation of one encoding to the other 如何将一种编码转换为另一种编码

You have to be in a locale which use Unicode. 您必须处于使用Unicode的语言环境中。 Hopefully 希望

std::locale::global(locale(""));

will be enough for that. 就足够了。 If not, your environment is not properly setup (or setup for another charset and assuming Unicode won't be a service to your user.). 如果没有,您的环境没有正确设置(或设置为另一个字符集,并假设Unicode不是您的用户的服务。)。

C Style C风格

Use the wcstomsb and mbstowcs functions. 使用wcstomsbmbstowcs功能。 Here is an example for what you asked. 这是你问的例子。

std::string narrow(std::wstring const& s)
{
    std::vector<char> result(4*s.size() + 1);
    size_t used = wcstomsb(&result[0], s.data(), result.size());
    assert(used < result.size());
    return result.data();
}

C++ Style C ++风格

The codecvt facet of the locale provide the needed functionality. 语言环境的codecvt方面提供了所需的功能。 The advantage is that you don't have to change the global locale for using it. 优点是您不必更改使用它的全局语言环境。 The inconvenient is that the usage is more complex. 不方便的是使用更复杂。

#include <locale>
#include <iostream>
#include <string>
#include <vector>
#include <assert.h>
#include <iomanip>

std::string narrow(std::wstring const& s,
                   std::locale loc = std::locale())
{
    std::vector<char> result(4*s.size() + 1);
    wchar_t const* fromNext;
    char* toNext;
    mbstate_t state = {0};
    std::codecvt_base::result convResult
        = std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t> >(loc)
        .out(state,&s[0], &s[s.size()], fromNext,
             &result[0], &result[result.size()], toNext);

    assert(fromNext == &s[s.size()]);
    assert(toNext != &result[result.size()]);
    assert(convResult == std::codecvt_base::ok);
    *toNext = '\0';

    return &result[0];
}

std::wstring widen(std::string const& s,
                   std::locale loc = std::locale())
{
    std::vector<wchar_t> result(s.size() + 1);
    char const* fromNext;
    wchar_t* toNext;
    mbstate_t state = {0};
    std::codecvt_base::result convResult
        = std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t> >(loc)
        .in(state, &s[0], &s[s.size()], fromNext,
            &result[0], &result[result.size()], toNext);

    assert(fromNext == &s[s.size()]);
    assert(toNext != &result[result.size()]);
    assert(convResult == std::codecvt_base::ok);
    *toNext = L'\0';

    return &result[0];
}

you should replace the assertions by better handling. 你应该通过更好的处理来替换断言。

BTW, this is standard C++ and doesn't assume Unicode excepted for the computation of the size of result, you can do better by checking convResult which can indicate a partial conversion). 顺便说一句,这是标准的C ++,并且不假设Unicode除了计算结果大小之外,你可以通过检查convResult来做得更好,这可以指示部分转换)。

最简单的方法是获取一个小型库,例如UTF8 CPP,并执行以下操作:

utf8::utf8to16(line.begin(), line.end(), back_inserter(utf16line));

I usually use the UnicodeConverter class from the Poco C++ libraries. 我通常使用Poco C ++库中的UnicodeConverter类。 If you don't want the dependency then you can have a look at the code. 如果您不想要依赖关系,那么您可以查看代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM