简体   繁体   English

std :: string字符编码

[英]std::string character encoding

std::string arrWords[10];
std::vector<std::string> hElemanlar;

...... ......

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]).c_str());

...... ......

What i am doing is: Every element of arrWord is a std::string. 我正在做的是:arrWord的每个元素都是一个std :: string。 I get the n th element of arrWord and then push them into hElemanlar. 我得到了arrWord的第n个元素,然后将它们推入hElemanlar。

Assuming arrWords[0] is "test", then: 假设arrWords [0]是“ test”,则:

this->hElemanlar.push_back("t");
this->hElemanlar.push_back("e");
this->hElemanlar.push_back("s");
this->hElemanlar.push_back("t");

And my problem is although i have no encoding problems with arrWords, some utf-8 characters are not printed or treated well in hElemanlar. 我的问题是尽管我没有使用arrWords进行编码的问题,但是在hElemanlar中某些utf-8字符无法正确打印或处理。 How can i fix it?s 我该如何解决?

If you know that arrWords[i] contains UTF-8 encoded text, then you probably need to split the strings into complete Unicode characters. 如果您知道arrWords[i]包含UTF-8编码的文本,则可能需要将字符串拆分为完整的Unicode字符。

As an aside, rather than saying: 顺便说一句,而不是说:

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]).c_str());

(which constructs a temporary std::string, obtains a the c-string representation of it, constructs another temporary string, and pushes that onto the vector), say: (构造一个临时的std :: string,获取它的c字符串表示,构造另一个临时的字符串,并将其压入向量),例如:

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]))

Anyway. 无论如何。 This will need to become something like: 这将需要变得像:

std::string str(1, this-arrWords[sayKelime][j])
if (static_cast<unsigned char>(str[0]) >= 0xC0)
{
   for (const char c = this-arrWords[sayKelime][j+1];
        static_cast<unsigned char>(c) >= 0x80;
        j++)
   {
       str.push_back(c);
   }
}
this->hElemenlar.push_back(str);

Note that the above loop is safe, because if j is the index of the last char in the string, [j+1] will return the nul-terminator (which will end the loop). 请注意,上述循环是安全的,因为如果j是字符串中最后一个char的索引,则[j+1]将返回nul-terminator(将结束循环)。 You will need to consider how incrementing j interacts with the rest of your code though. 但是,您将需要考虑递增j与其余代码的交互方式。

You then need to consider whether you want hElemanlar to represent individual Unicode code points (which this does), or do you want to include a character + all the combining characters that follow? 然后,您需要考虑是否要让hElemanlar表示单个Unicode代码点(这样做),还是要包含一个字符+后面的所有组合字符? In the latter case, you would have to extend the code above to: 在后一种情况下,您必须将上面的代码扩展为:

  • Parse the next code-point 解析下一个代码点
  • Decide whether it is a combining character 确定是否是组合字符
  • Push the UTF-8 sequence on the string if so. 如果这样,将UTF-8序列推入字符串。
  • Repeat (you can have multiple combining characters on a character). 重复(一个字符上可以有多个组合字符)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM