简体   繁体   English

带有非ASCII字符的反向字符串

[英]Reverse string with non-ASCII characters

I want to change the order in the string with special characters like this: 我想用特殊字符更改字符串中的顺序,如下所示:

ZAŻÓŁĆ GĘŚLĄ JAŹŃ ZAŻÓŁĆGĘŚLĄJAŹŃ

to

ŃŹAJ ĄŁŚĘG ĆŁÓŻAZ ŃŹAJĄŁŚĘGĆŁÓŻAZ

I try to use std::reverse 我尝试使用std :: reverse

std::string text("ZAŻÓŁĆ GĘŚLĄ JAŹŃ!");
std::cout << text << std::endl;
std::reverse(text.rbegin(), text.rend());
std::cout << text << std::endl;

but the output show me that: 但输出显示:

ZAŻÓŁĆ GĘŚLĄ JAŹŃ! ZAŻÓŁĆGĘŚLĄJAŹŃ!

!\\203Ź\\305AJ \\204\\304L\\232Ř\\304G \\206āœû\\305AZ <- reversed string !\\203Ź\\ 305AJ \\ 204 \\ 304L \\232Ř\\ 304G \\206āœû\\ 305AZ < - 反向弦

So i try do this "manually" : 所以我尝试“手动”:

std::string text1("ZAŻÓŁĆ GĘŚLĄ JAŹŃ!");
std::cout << text1 << std::endl;
int count = (int) floorf(text1.size() /2.f);
std::cout << count  << "  " << text1.size() << std::endl;

unsigned int maxIndex = text1.size() - 1;
for (int i = 0; i < count ; i++)
{
    char tmp = text1[i];
    text1[i] = text1[maxIndex];
    text1[maxIndex] = tmp;
    maxIndex--;
}
std::cout << text1 << std::endl;

But in this case I have a problem in text1.size() because every special character are counted twice: 但在这种情况下,我在text1.size()中遇到问题,因为每个特殊字符都被计算两次:

ZAŻÓŁĆ GĘŚLĄ JAŹŃ! ZAŻÓŁĆGĘŚLĄJAŹŃ!

13 27 <- second number is text1.size() 13 27 < - 第二个数字是text1.size()

!\\203Ź\\305AJ \\204\\304L\\232Ř\\304G \\206āœû\\305AZ !\\203Ź\\ 305AJ \\ 204 \\ 304L \\232Ř\\ 304G \\206āœû\\ 305AZ

How is the proper way to reverse a string with special characters? 如何用特殊字符反转字符串的正确方法?

Your code really does correctly reverse bytes in your string, there's nothing wrong here. 您的代码确实正确地反转了字符串中的字节,这里没有任何错误。 The problem, however, is that your compiler stores your literal string "ZAŻÓŁĆ GĘŚLĄ JAŹŃ!" 但问题是你的编译器存储了你的文字字符串“ZAŻÓŁĆGĘŚLĄJAŹŃ!” in UTF-8 encoding. 采用UTF-8编码。

And UTF-8 stores all characters except those that match ASCII as variable-length sequences of bytes. 并且UTF-8将除了匹配ASCII的字符之外的所有字符存储为可变长度的字节序列 This means that one char (one byte) is no longer one character, so reversing char 's isn't now the same as reversing characters. 这意味着一个char (一个字节)不再是一个字符,因此反转char现在与反转字符不同。

To achieve your goal you have at least two options: 为了实现您的目标,您至少有两个选择:

  1. Use some utf-8 library that will let you iterate characters instead of bytes. 使用一些utf-8库,它可以让你迭代字符而不是字节。 One example is http://utfcpp.sourceforge.net/ 一个例子是http://utfcpp.sourceforge.net/
  2. Somehow (and that depends a lot on the compiler and OS you are using) switch to utf-32 encoding that has constant character length and have good old constant-character-size strings without all this crazy variable-character-size troubles. 不知何故(这在很大程度上取决于您正在使用的编译器和操作系统)切换到具有恒定字符长度的utf-32编码,并且具有良好的旧的常量字符大小字符串,而没有所有这些疯狂的变量字符大小的麻烦。

UPD: A nice link for you: http://www.joelonsoftware.com/articles/Unicode.html UPD:一个很好的链接: http//www.joelonsoftware.com/articles/Unicode.html

You might code a reverseUt8 function by yourself: 您可以自己编写reverseUt8函数代码:

std::string getMultiByteReversed(char ch1, char ch2)
{  
   if (ch == '\xc3') // most utf8 characters
      return std::string(ch1)+ std::string(ch2);
   } else {
      return std::string(ch1);
   }
}

std::string reverseMultiByteString(const std::string &s)
{
    std::string result;
    for (std::string::reverse_iterator it = s.rbegin(); it != s.rend(); ++it) {
       std::string reversed;
       if ( (it+1) != rbegin() && (reversed = getMultiByteReversed(*it, *it+1) ) {
          result += reversed;
          ++it;
       } else {
          result += *it;
       }
  }
  return result;
}

You can look up the utf8 codes at: http://www.utf8-chartable.de/ 您可以在http://www.utf8-chartable.de/查找utf8代码

There are a couple of issues here. 这里有几个问题。 The answer is complex and can depend on exactly what you're trying to do. 答案很复杂,可能完全取决于您要做的事情。

First is that (as other answers have stated) if your string is UTF-8 encoded, one Unicode code point may consist of multiple bytes. 首先是(如其他答案所述)如果您的字符串是UTF-8编码,则一个Unicode代码点可能包含多个字节。 If you just reverse the bytes, you'll break the UTF-8 encoding. 如果你只是反转字节,你将打破UTF-8编码。 The simplest (though not necessarily the best) fix for this is to convert the string to UTF-32 and reverse the 32-bit code points rather than bytes. 对此最简单(但不一定是最好的)修复是将字符串转换为UTF-32并反转32位代码点而不是字节。

The next problem is that a single grapheme might consist of multiple Unicode code points. 下一个问题是单个字素可能包含多个Unicode代码点。 For example, a "é" might be encoded as the two code points U+0065 followed by U+0301. 例如,“é”可以编码为两个代码点U + 0065,后跟U + 0301。 If you reverse the order of these, that will break it as the combining character U+301 will now be associate with a different base character. 如果颠倒这些顺序,那将打破它,因为组合字符U + 301现在将与不同的基本字符相关联。 So "Pokémon" reversed this way would become "noḿekoP" with the accent over the "m" instead of the "e". 所以“神奇宝贝”逆转这种方式会变成“noḿekoP”,重音超过“m”而不是“e”。

Now you might think that you can get around this problem by normalizing the string into a composed form first. 现在你可能会认为你可以通过首先将字符串规范化为组合形式来解决这个问题。 That has its own problems, however, because not every grapheme can be represented by a single code point. 然而,这有其自身的问题,因为并非每个字形都可以由单个代码点表示。 For example, the Canadian flag emoji (🇨🇦) is represented by the code point U+1F1E8 followed by the code point U+1F1E6. 例如,加拿大标志表情符号(🇨🇦)由代码点U + 1F1E8表示,后跟代码点U + 1F1E6。 There is no single code point for it. 它没有单一的代码点。 If you reverse its code points, you get the flag for Ascension Island (🇦🇨) instead. 如果您反转其代码点,则会获得Ascension Island(🇦🇨)的标记。

Then you have languages where characters change form based on context, and I don't yet know much about dealing with those. 然后你就有了基于上下文改变形式的语言,而且我对处理这些语言还不太了解。

It may be closer to what you want to reverse grapheme clusters . 它可能更接近您想要反转字形集群 See UAX29: Unicode text segmentation . 请参阅UAX29:Unicode文本分段

have you tried swapping characters one by one. 你尝试过逐个交换字符吗? For example, if the string length is odd, swap the first character with the last, second with the second last, till the middle character is left. 例如,如果字符串长度为奇数,则将第一个字符与最后一个字符交换,将第二个字符与第二个字符交换,直到中间字符为左。 If the string lengt is even, swap 1st with last, 2nd with 2nd last, till both the middle characters are swapped. 如果字符串lengt是偶数,则将1st替换为last,将2nd替换为second last,直到两个中间字符都被交换。 In that way, the string will be reversed. 这样,字符串就会反转。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM