简体   繁体   English

比较std :: wstring和std :: string

[英]Compare std::wstring and std::string

How can I compare a wstring , such as L"Hello" , to a string ? 我如何可以比较wstring ,如L"Hello" ,以一个string If I need to have the same type, how can I convert them into the same type? 如果我需要相同的类型,我怎样才能将它们转换为相同的类型?

Since you asked, here's my standard conversion functions from string to wide string, implemented using C++ std::string and std::wstring classes. 既然你问,这是我从字符串到宽字符串的标准转换函数,使用C ++ std::stringstd::wstring类实现。

First off, make sure to start your program with set_locale : 首先,确保使用set_locale启动程序:

#include <clocale>

int main()
{
  std::setlocale(LC_CTYPE, "");  // before any string operations
}

Now for the functions. 现在为功能。 First off, getting a wide string from a narrow string: 首先,从一个窄字符串中获取一个宽字符串:

#include <string>
#include <vector>
#include <cassert>
#include <cstdlib>
#include <cwchar>
#include <cerrno>

// Dummy overload
std::wstring get_wstring(const std::wstring & s)
{
  return s;
}

// Real worker
std::wstring get_wstring(const std::string & s)
{
  const char * cs = s.c_str();
  const size_t wn = std::mbsrtowcs(NULL, &cs, 0, NULL);

  if (wn == size_t(-1))
  {
    std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
    return L"";
  }

  std::vector<wchar_t> buf(wn + 1);
  const size_t wn_again = std::mbsrtowcs(buf.data(), &cs, wn + 1, NULL);

  if (wn_again == size_t(-1))
  {
    std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
    return L"";
  }

  assert(cs == NULL); // successful conversion

  return std::wstring(buf.data(), wn);
}

And going back, making a narrow string from a wide string. 然后回去,用宽弦做一个窄弦。 I call the narrow string "locale string", because it is in a platform-dependent encoding depending on the current locale: 我将窄字符串称为“locale string”,因为它依赖于平台相关的编码,具体取决于当前的语言环境:

// Dummy
std::string get_locale_string(const std::string & s)
{
  return s;
}

// Real worker
std::string get_locale_string(const std::wstring & s)
{
  const wchar_t * cs = s.c_str();
  const size_t wn = std::wcsrtombs(NULL, &cs, 0, NULL);

  if (wn == size_t(-1))
  {
    std::cout << "Error in wcsrtombs(): " << errno << std::endl;
    return "";
  }

  std::vector<char> buf(wn + 1);
  const size_t wn_again = std::wcsrtombs(buf.data(), &cs, wn + 1, NULL);

  if (wn_again == size_t(-1))
  {
    std::cout << "Error in wcsrtombs(): " << errno << std::endl;
    return "";
  }

  assert(cs == NULL); // successful conversion

  return std::string(buf.data(), wn);
}

Some notes: 一些说明:

  • If you don't have std::vector::data() , you can say &buf[0] instead. 如果你没有std::vector::data() ,你可以说&buf[0]
  • I've found that the r -style conversion functions mbsrtowcs and wcsrtombs don't work properly on Windows. 我发现r style转换函数mbsrtowcswcsrtombs在Windows上无法正常工作。 There, you can use the mbstowcs and wcstombs instead: mbstowcs(buf.data(), cs, wn + 1); 在那里,你可以使用mbstowcswcstombs代替: mbstowcs(buf.data(), cs, wn + 1); , wcstombs(buf.data(), cs, wn + 1); wcstombs(buf.data(), cs, wn + 1);


In response to your question, if you want to compare two strings, you can convert both of them to wide string and then compare those. 在回答您的问题时,如果您想比较两个字符串,可以将它们转换为宽字符串然后进行比较。 If you are reading a file from disk which has a known encoding, you should use iconv() to convert the file from your known encoding to WCHAR and then compare with the wide string. 如果从磁盘读取具有已知编码的文件,则应使用iconv()将文件从已知编码转换为WCHAR,然后与宽字符串进行比较。

Beware, though, that complex Unicode text may have multiple different representations as code point sequences which you may want to consider equal. 但要注意,复杂的Unicode文本可能有多种不同的表示形式,您可能需要考虑相同的代码点序列。 If that is a possibility, you need to use a higher-level Unicode processing library (such as ICU) and normalize your strings to some common, comparable form. 如果可能,您需要使用更高级别的Unicode处理库(例如ICU)并将字符串规范化为一些常见的可比较形式。

You should convert the char string to a wchar_t string using mbstowcs , and then compare the resulting strings. 您应该使用mbstowcschar字符串转换为wchar_t字符串,然后比较生成的字符串。 Notice that mbstowcs works on char * / wchar * , so you'll probably need to do something like this: 请注意, mbstowcs适用于char * / wchar * ,因此您可能需要执行以下操作:

std::wstring StringToWstring(const std::string & source)
{
    std::wstring target(source.size()+1, L' ');
    std::size_t newLength=std::mbstowcs(&target[0], source.c_str(), target.size());
    target.resize(newLength);
    return target;
}

I'm not entirely sure that that usage of &target[0] is entirely standard-conforming, if someone has a good answer to that please tell me in the comments. 我不完全确定&target[0]是完全符合标准的,如果有人对此有好的答案,请在评论中告诉我。 Also, there's an implicit assumption that the converted string won't be longer (in number of wchar_t s) than the number of char s of the original string - a logical assumption that still I'm not sure it's covered by the standard. 此外,还有一个隐含的假设,即转换后的字符串不会比原始字符串的char数更长( wchar_t s的数量) - 这是一个逻辑假设,我仍然不确定它是否被标准所涵盖。

On the other hand, it seems that there's no way to ask to mbstowcs the size of the needed buffer, so either you go this way, or go with (better done and better defined) code from Unicode libraries (be it Windows APIs or libraries like iconv). 另一方面,似乎没有办法向mbstowcs询问所需缓冲区的大小,所以你要么这样做,要么使用Unicode库中的(做得更好,定义更好)代码(无论是Windows API还是库)像iconv)。

Still, keep in mind that comparing Unicode strings without using special functions is slippery ground, two equivalent strings may be evaluated different when compared bitwise. 但是,请记住,在不使用特殊功能的情况下比较Unicode字符串是很滑的,当按位比较时,两个等效的字符串可能会被评估为不同。

Long story short: this should work, and I think it's the maximum you can do with just the standard library, but it's a lot implementation-dependent in how Unicode is handled, and I wouldn't trust it a lot. 简而言之:这应该可以工作,我认为这只是标准库所能达到的最大值,但它在很大程度上取决于Unicode的处理方式,我不会相信它。 In general, it's just better to stick with an encoding inside your application and avoid this kind of conversions unless absolutely necessary, and, if you are working with definite encodings, use APIs that are less implementation-dependent. 一般来说,最好在应用程序中坚持使用编码并避免这种转换,除非绝对必要,并且,如果您正在使用明确的编码,请使用与实现相关性较低的API。

Think twice before doing this — you might not want to compare them in the first place. 在做这个之前要三思而后行 - 你可能不想在一开始就比较它们。 If you are sure you do and you are using Windows, then convert string to wstring with MultiByteToWideChar , then compare with CompareStringEx . 如果您确定并且使用的是Windows,则使用MultiByteToWideCharstring转换为wstring ,然后与CompareStringEx进行比较。

If you are not using Windows, then the analogous functions are mbstowcs and wcscmp . 如果您不使用Windows,则类似的功能是mbstowcswcscmp The standard wide character C++ functions are often not portable under Windows; 标准的宽字符C ++函数在Windows下通常不可移植; for instance mbstowcs is deprecated. 例如mbstowcs已被弃用。

The cross-platform way to work with Unicode is to use the ICU library. 使用Unicode的跨平台方式是使用ICU库。

Take care to use special functions for Unicode string comparison, don't do it manually. 注意使用特殊函数进行Unicode字符串比较,不要手动执行。 Two Unicode strings could have different characters, yet still be the same. 两个Unicode字符串可以具有不同的字符,但仍然是相同的。

wstring ConvertToUnicode(const string & str)
{
    UINT  codePage = CP_ACP;
    DWORD flags    = 0;
    int resultSize = MultiByteToWideChar
        ( codePage     // CodePage
        , flags        // dwFlags
        , str.c_str()  // lpMultiByteStr
        , str.length() // cbMultiByte
        , NULL         // lpWideCharStr
        , 0            // cchWideChar
        );
    vector<wchar_t> result(resultSize + 1);
    MultiByteToWideChar
        ( codePage     // CodePage
        , flags        // dwFlags
        , str.c_str()  // lpMultiByteStr
        , str.length() // cbMultiByte
        , &result[0]   // lpWideCharStr
        , resultSize   // cchWideChar
        );
    return &result[0];
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM