[英]How can I convert string like “\u94b1” to one real character in C++?
We know in string literal, "\钱" will be converted to a character, in this case a Chinese word '钱'. 我们知道在字符串文字中,“ \\ u94b1”将被转换为字符,在这种情况下为中文单词“钱”。 But if it is literally 6 character in a string, saying '\\', 'u', '9', '4', 'b', '1', how can I convert it to a character manually?
但是,如果它实际上是字符串中的6个字符,并说“ \\”,“ u”,“ 9”,“ 4”,“ b”,“ 1”,该如何手动将其转换为字符?
For example: 例如:
string s1;
string s2 = "\u94b1";
cin >> s1; //here I input \u94b1
cout << s1 << endl; //here output \u94b1
cout << s2 << endl; //and here output 钱
I want to convert s1
so that cout << s1 << endl;
我想转换
s1
以便cout << s1 << endl;
will also output 钱
. 也会输出
钱
。
Any suggestion please? 有什么建议吗?
In fact the conversion is a little more complicated. 实际上,转换要复杂一些。
string s2 = "\u94b1";
is in fact the equivalent of: 实际上等于:
char cs2 = { 0xe9, 0x92, 0xb1, 0}; string s2 = cs2;
That means that you are initializing it the the 3 characters that compose the UTF8 representation of 钱 - you char just examine s2.c_str()
to make sure of that. 这意味着您要初始化组成钱的UTF8表示形式的3个字符-您可以检查
s2.c_str()
来确保这一点。
So to process the 6 raw characters '\\', 'u', '9', '4', 'b', '1', you must first extract the wchar_t from string s1 = "\\\钱";
因此,要处理6个原始字符“ \\”,“ u”,“ 9”,“ 4”,“ b”,“ 1”,必须首先从
string s1 = "\\\钱";
提取wchar_t string s1 = "\\\钱";
(what you get when you read it). (阅读时会得到什么)。 It is easy, just skip the two first characters and read it as hexadecimal:
很简单,只需跳过前两个字符并将其读取为十六进制:
unsigned int ui;
std::istringstream is(s1.c_str() + 2);
is >> hex >> ui;
ui
is now 0x94b1
. ui
现在是0x94b1
。
Now provided you have a C++11 compliant system, you can convert it with std::convert_utf8
: 现在,如果您拥有一个符合C ++ 11的系统,则可以使用
std::convert_utf8
对其进行转换:
wchar_t wc = ui;
std::codecvt_utf8<wchar_t> conv;
const wchar_t *wnext;
char *next;
char cbuf[4] = {0}; // initialize the buffer to 0 to have a terminating null
std::mbstate_t state;
conv.out(state, &wc, &wc + 1, wnext, cbuf, cbuf+4, next);
cbuf
contains now the 3 characters representing 钱 in utf8 and a terminating null, and you finaly can do: cbuf
现在包含utf8中代表钱的3个字符和一个终止的null,您可以最终做到:
string s3 = cbuf;
cout << s3 << endl;
You do this by writing code that checks whether the string contains a backslash, a letter u, and four hexadecimal digits, and converts this to a Unicode code point. 通过编写代码来执行此操作,该代码检查字符串是否包含反斜杠,字母u和四个十六进制数字,并将其转换为Unicode代码点。 Then your std::string implementation probably assumes UTF-8, so you translate that code point into 1, 2, or 3 UTF-8 bytes.
然后,您的std :: string实现可能采用UTF-8,因此您可以将该代码点转换为1、2或3个UTF-8字节。
For extra points, figure out how to enter code points outside the basic plane. 有关其他点,请弄清楚如何在基本平面之外输入代码点。
With utfcpp (header only) you may do: 使用utfcpp (仅标题),您可以执行以下操作:
#include </usr/include/utf8.h>
#include <cstdint>
#include <iostream>
std::string replace_utf8_escape_sequences(const std::string& str) {
std::string result;
std::string::size_type first = 0;
std::string::size_type last = 0;
while(true) {
// Find an escape position
last = str.find("\\u", last);
if(last == std::string::npos) {
result.append(str.begin() + first, str.end());
break;
}
// Extract a 4 digit hexadecimal
const char* hex = str.data() + last + 2;
char* hex_end;
std::uint_fast32_t code = std::strtoul(hex, &hex_end, 16);
std::string::size_type hex_size = hex_end - hex;
// Append the leading and converted string
if(hex_size != 4) last = last + 2 + hex_size;
else {
result.append(str.begin() + first, str.begin() + last);
try {
utf8::utf16to8(&code, &code + 1, std::back_inserter(result));
}
catch(const utf8::exception&) {
// Error Handling
result.clear();
break;
}
first = last = last + 2 + 4;
}
}
return result;
}
int main()
{
std::string source = "What is the meaning of '\\u94b1' '\\u94b1' '\\u94b1' '\\u94b1' ?";
std::string target = replace_utf8_escape_sequences(source);
std::cout << "Conversion from \"" << source << "\" to \"" << target << "\"\n";
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.