简体   繁体   English

字符串到 UTF-8 在 C++ 中的转换

[英]string to UTF-8 conversion in C++

I have a string Test\xc2\xae represented in Hex as 0x54 0x65 0x73 0x74 0x5c 0x78 0x63 0x32 0x5c 0x78 0x61 0x65 .我有一个字符串Test\xc2\xae xae 以十六进制表示为0x54 0x65 0x73 0x74 0x5c 0x78 0x63 0x32 0x5c 0x78 0x61 0x65 The character set \xc2\xae in this string is nothing but the UTF-8 Encoding of ® (registered trademark).这个字符串中的字符集\xc2\xae xae不过是®(注册商标)的UTF-8编码。

I want to write a c++ function which can convert \xc2 (in Hex 0x5c 0x78 0x63 0x32 ) character set to hex value 0xc2 .我想写一个 c++ function 可以将\xc2 (十六进制0x5c 0x78 0x63 0x32 )字符集转换为十六进制值0xc2

eg I want to write a c++ function which can convert Test\xc2\xae [ 0x54 0x65 0x73 0x74 0x5c 0x78 0x63 0x32 0x5c 0x78 0x61 0x65 ] to Test® [ 0x54 0x65 0x73 0x74 0xc2 0xae ] eg I want to write a c++ function which can convert Test\xc2\xae xae [ 0x54 0x65 0x73 0x74 0x5c 0x78 0x63 0x32 0x5c 0x78 0x61 0x65 ] to Test® [ 0x54 0x65 0x73 0x74 0xc2 0xae ]

As far as I understand your question, I think that you try to convert each \x??据我了解您的问题,我认为您尝试转换每个\x?? sequence (four chars), where ??序列(四个字符),在哪里?? is a sequence of two hexadecimal digits, to a unique char with the value which was expressed in hexadecimal.是一个由两个十六进制数字组成的序列,对应一个唯一的 char,其值以十六进制表示。

If you don't have to use huge libraries dedicated to this, maybe this trivial algorithm could do the trick.如果您不必使用专门用于此的大型库,也许这个简单的算法可以解决问题。

/**
  g++ -std=c++17 -o prog_cpp prog_cpp.cpp \
      -pedantic -Wall -Wextra -Wconversion -Wno-sign-conversion \
      -g -O0 -UNDEBUG -fsanitize=address,undefined
**/

#include <iostream>
#include <string>
#include <cctype>

std::string
convert_backslash_x(const std::string &str)
{
  auto result=std::string{};
  for(auto start=std::string::size_type{0};;)
  {
    const auto pos=str.find("\\x", start);
    if((pos==str.npos)||  // not found
       (pos+4>size(str))) // too near from the end
    {
      // keep the remaining of the string
      result.append(str, start);
      break;
    }
    // keep everything until this position
    result.append(str, start, pos-start);
    const auto c1=std::tolower(str[pos+2]), c2=std::tolower(str[pos+3]);
    if(std::isxdigit(c1)&&std::isxdigit(c2))
    {
      // convert two hex digits to a char with this value
      const auto h1=std::isalpha(c1) ? 10+(c1-'a') : (c1-'0');
      const auto h2=std::isalpha(c2) ? 10+(c2-'a') : (c2-'0');
      result+=char(h1*16+h2);
      // go on after this \x?? sequence
      start=pos+4; 
    }
    else
    {
      // keep this incomplete \x sequence as is
      result+="\\x";
      // go on after this \x sequence
      start=pos+2;
    }
  }
  return result;
}

int
main()
{
  for(const auto &s: {"Test\\xc2\\xae",
                      "Test\\xc2\\xae Test\\xc2\\xae",
                      "Test\\xc2\\xa",
                      "Test\\x\\xc2\\xa"})
  {
    std::cout << '(' << s << ") --> (" << convert_backslash_x(s) << ")\n";
  }
  return 0;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM