简体   繁体   中英

Reverse string with non-ASCII characters

I want to change the order in the string with special characters like this:

ZAŻÓŁĆ GĘŚLĄ JAŹŃ

to

ŃŹAJ ĄŁŚĘG ĆŁÓŻAZ

I try to use std::reverse

std::string text("ZAŻÓŁĆ GĘŚLĄ JAŹŃ!");
std::cout << text << std::endl;
std::reverse(text.rbegin(), text.rend());
std::cout << text << std::endl;

but the output show me that:

ZAŻÓŁĆ GĘŚLĄ JAŹŃ!

!\\203Ź\\305AJ \\204\\304L\\232Ř\\304G \\206āœû\\305AZ <- reversed string

So i try do this "manually" :

std::string text1("ZAŻÓŁĆ GĘŚLĄ JAŹŃ!");
std::cout << text1 << std::endl;
int count = (int) floorf(text1.size() /2.f);
std::cout << count  << "  " << text1.size() << std::endl;

unsigned int maxIndex = text1.size() - 1;
for (int i = 0; i < count ; i++)
{
    char tmp = text1[i];
    text1[i] = text1[maxIndex];
    text1[maxIndex] = tmp;
    maxIndex--;
}
std::cout << text1 << std::endl;

But in this case I have a problem in text1.size() because every special character are counted twice:

ZAŻÓŁĆ GĘŚLĄ JAŹŃ!

13 27 <- second number is text1.size()

!\\203Ź\\305AJ \\204\\304L\\232Ř\\304G \\206āœû\\305AZ

How is the proper way to reverse a string with special characters?

Your code really does correctly reverse bytes in your string, there's nothing wrong here. The problem, however, is that your compiler stores your literal string "ZAŻÓŁĆ GĘŚLĄ JAŹŃ!" in UTF-8 encoding.

And UTF-8 stores all characters except those that match ASCII as variable-length sequences of bytes. This means that one char (one byte) is no longer one character, so reversing char 's isn't now the same as reversing characters.

To achieve your goal you have at least two options:

  1. Use some utf-8 library that will let you iterate characters instead of bytes. One example is http://utfcpp.sourceforge.net/
  2. Somehow (and that depends a lot on the compiler and OS you are using) switch to utf-32 encoding that has constant character length and have good old constant-character-size strings without all this crazy variable-character-size troubles.

UPD: A nice link for you: http://www.joelonsoftware.com/articles/Unicode.html

You might code a reverseUt8 function by yourself:

std::string getMultiByteReversed(char ch1, char ch2)
{  
   if (ch == '\xc3') // most utf8 characters
      return std::string(ch1)+ std::string(ch2);
   } else {
      return std::string(ch1);
   }
}

std::string reverseMultiByteString(const std::string &s)
{
    std::string result;
    for (std::string::reverse_iterator it = s.rbegin(); it != s.rend(); ++it) {
       std::string reversed;
       if ( (it+1) != rbegin() && (reversed = getMultiByteReversed(*it, *it+1) ) {
          result += reversed;
          ++it;
       } else {
          result += *it;
       }
  }
  return result;
}

You can look up the utf8 codes at: http://www.utf8-chartable.de/

There are a couple of issues here. The answer is complex and can depend on exactly what you're trying to do.

First is that (as other answers have stated) if your string is UTF-8 encoded, one Unicode code point may consist of multiple bytes. If you just reverse the bytes, you'll break the UTF-8 encoding. The simplest (though not necessarily the best) fix for this is to convert the string to UTF-32 and reverse the 32-bit code points rather than bytes.

The next problem is that a single grapheme might consist of multiple Unicode code points. For example, a "é" might be encoded as the two code points U+0065 followed by U+0301. If you reverse the order of these, that will break it as the combining character U+301 will now be associate with a different base character. So "Pokémon" reversed this way would become "noḿekoP" with the accent over the "m" instead of the "e".

Now you might think that you can get around this problem by normalizing the string into a composed form first. That has its own problems, however, because not every grapheme can be represented by a single code point. For example, the Canadian flag emoji (🇨🇦) is represented by the code point U+1F1E8 followed by the code point U+1F1E6. There is no single code point for it. If you reverse its code points, you get the flag for Ascension Island (🇦🇨) instead.

Then you have languages where characters change form based on context, and I don't yet know much about dealing with those.

It may be closer to what you want to reverse grapheme clusters . See UAX29: Unicode text segmentation .

have you tried swapping characters one by one. For example, if the string length is odd, swap the first character with the last, second with the second last, till the middle character is left. If the string lengt is even, swap 1st with last, 2nd with 2nd last, till both the middle characters are swapped. In that way, the string will be reversed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM