無法從 C++ std::string 中提取 Unicode 符號

Question

我正在尋找一個 C++ std::string，然后將該 std::string 傳遞給一個分析它的函數，然后從中提取 Unicode 符號和簡單的 ASCII 符號。

網上查了很多教程，都提到標准C++不完全支持Unicode格式。 他們中的許多人提到使用ICU C++ 。

這是我的 C++ 程序，用於理解上述功能的基礎。 它讀取原始字符串，轉換為 ICU Unicode 字符串並打印：

#include <iostream>
#include <string>
#include "unicode/unistr.h"

int main()
{
    std::string s="Hello☺";
    // at this point s contains a line of text
    // which may be ANSI or UTF-8 encoded

    // convert std::string to ICU's UnicodeString
    icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));

    // convert UnicodeString to std::wstring
    std::wstring ws;
    for (int i = 0; i < ucs.length(); ++i)
      ws += static_cast<wchar_t>(ucs[i]);

    std::wcout << ws << std::endl;
}

預期輸出：

Hello☺

實際輸出：

Hello?

請建議我做錯了什么。 還建議任何替代/更簡單的方法

謝謝

更新 1（舊）：工作代碼如下：

#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"

void f(const std::string & s)
{
  std::wcout << "Inside called function" << std::endl;
  constexpr char locale_name[] = "";
  setlocale( LC_ALL, locale_name );
  std::locale::global(std::locale(locale_name));
  std::ios_base::sync_with_stdio(false);
  std::wcin.imbue(std::locale());
  std::wcout.imbue(std::locale());

  // at this point s contains a line of text which may be ANSI or UTF-8 encoded

  // convert std::string to ICU's UnicodeString
  icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));

  // convert UnicodeString to std::wstring
  std::wstring ws;
  for (int i = 0; i < ucs.length(); ++i)
    ws += static_cast<wchar_t>(ucs[i]);

  std::wcout << ws << std::endl;
}

int main()
{
    constexpr char locale_name[] = "";
    setlocale( LC_ALL, locale_name );
    std::locale::global(std::locale(locale_name));
    std::ios_base::sync_with_stdio(false);
    std::wcin.imbue(std::locale());
    std::wcout.imbue(std::locale());

    std::wcout << "Inside main function" << std::endl;

    std::string s=u8"hello☺";
    // at this point s contains a line of text which may be ANSI or UTF-8 encoded

    // convert std::string to ICU's UnicodeString
    icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));

    // convert UnicodeString to std::wstring
    std::wstring ws;
    for (int i = 0; i < ucs.length(); ++i)
      ws += static_cast<wchar_t>(ucs[i]);

    std::wcout << ws << std::endl;
    std::wcout << "--------------------------------" << std::endl;
    f(s);
    return 0;
}

現在，預期輸出和實際輸出都相同，即：

Inside main function
hello☺
--------------------------------
Inside called function
hello☺

更新 2（最新）：更新 1 中提到的代碼不適用於像 😆 這樣的 UTF32 符號。 因此，所有可能的 Unicode 符號的工作代碼如下。 特別感謝@Botje的解決方案。 我希望我能對他的解決方案給出不止一個勾號！！！ :)

#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"
#include "unicode/ustream.h"

void f(const std::u32string & s)
{
  std::wcout << "INSIDE CALLED FUNCTION:" << std::endl;

  icu::UnicodeString ustr = icu::UnicodeString::fromUTF32(reinterpret_cast<const UChar32 *>(s.c_str()), s.size());
  std::cout << "Unicode string is: " << ustr << std::endl;

  std::cout << "Size of unicode string = " << ustr.countChar32() << std::endl;

  std::cout << "Individual characters of the string are:" << std::endl;
  for(int i=0; i < ustr.countChar32(); i++)
    std::cout << icu::UnicodeString(ustr.char32At(i)) << std::endl;

  std::cout << "--------------------------------" << std::endl;
}

int main()
{
    std::cout << "--------------------------------" << std::endl;
    constexpr char locale_name[] = "";
    setlocale( LC_ALL, locale_name );
    std::locale::global(std::locale(locale_name));
    std::ios_base::sync_with_stdio(false);
    std::wcin.imbue(std::locale());
    std::wcout.imbue(std::locale());

    std::wcout << "INSIDE MAIN FUNCTION:" << std::endl;

    std::u32string s=U"hello☺😆";

    icu::UnicodeString ustr = icu::UnicodeString::fromUTF32(reinterpret_cast<const UChar32 *>(s.c_str()), s.size());
    std::cout << "Unicode string is: " << ustr << std::endl;

    std::cout << "Size of unicode string = " << ustr.countChar32() << std::endl;

    std::cout << "Individual characters of the string are:" << std::endl;
    for(int i=0; i < ustr.countChar32(); i++)
      std::cout << icu::UnicodeString(ustr.char32At(i)) << std::endl;

    std::cout << "--------------------------------" << std::endl;
    f(s);
    return 0;
}

現在，預期輸出和實際輸出都相同，即：

--------------------------------
INSIDE MAIN FUNCTION:
Unicode string is: hello☺😆
Size of unicode string = 7
Individual characters of the string are:
h
e
l
l
o
☺
😆
--------------------------------
INSIDE CALLED FUNCTION:
Unicode string is: hello☺😆
Size of unicode string = 7
Individual characters of the string are:
h
e
l
l
o
☺
😆
--------------------------------

Answer 1

要做到這一點，有許多絆腳石：

首先，您的文件（以及其中的笑臉）應編碼為 UTF-8。 笑臉應該由文字字節0xE2 0x98 0xBA 。
您應該使用u8裝飾器將字符串標記為包含 UTF-8 數據： u8"Hello☺"
接下來， icu::UnicodeString的文檔說明它將 Unicode 存儲為 UTF-16。 在這種情況下，您很幸運，因為 U+263A 適合一個 UTF-16 字符。 其他表情符號可能不會！ 您應該將其轉換為 UTF-32，或者非常小心並使用GetChar32At函數。
最后， wcout使用的編碼應該使用imbue配置以匹配您的環境期望的編碼。 請參閱此問題的答案。

無法從 C++ std::string 中提取 Unicode 符號

問題描述

1 個解決方案

解決方案1
3 已采納 2020-02-06 10:19:02

無法從 C++ std::string 中提取 Unicode 符號

問題描述

1 個解決方案

解決方案1 3 已采納 2020-02-06 10:19:02

解決方案1
3 已采納 2020-02-06 10:19:02