简体   繁体   English

std :: string在安全的地方截断utf-8的最佳方法

[英]std::string optimal way to truncate utf-8 at safe place

I have a valid utf-8 encoded string in a std::string. 我在std :: string中有一个有效的utf-8编码字符串。 I have limit in bytes. 我有字节限制。 I would like to truncate the string and add ... at MAX_SIZE - 3 - x - where x is that value that will prevent a utf-8 character to be cut. 我想截断字符串并在MAX_SIZE - 3 - x处添加... MAX_SIZE - 3 - x - 其中x是阻止utf-8字符被切断的值。

Is there function that could determine x based on MAX_SIZE without the need to start from the beginning of the string? 是否有可以根据MAX_SIZE确定x的函数而无需从字符串的开头开始?

If you have a location in a string, and you want to go backwards to find the start of a UTF-8 character (and therefore a valid place to cut), this is fairly easily done. 如果你在一个字符串中有一个位置,并且你想要向后找到一个UTF-8字符的开头(因此是一个有效的剪切位置),这很容易完成。

You start from the last byte in the sequence. 从序列中的最后一个字节开始。 If the top two bits of the last byte are 10 , then it is part of a UTF-8 sequence, so keep backing up until the top two bits are not 10 (or until you reach the start). 如果最后一个字节的前两位是10 ,那么它是UTF-8序列的一部分,所以继续备份直到前两位不是 10 (或直到你到达开始)。

The way UTF-8 works is that a byte can be one of three things, based on the upper bits of the byte. UTF-8的工作方式是,一个字节可以是三种情况之一,基于字节的高位。 If the topmost bit is 0 , then the byte is an ASCII character, and the next 7 bits are the Unicode Codepoint value itself. 如果最高位为0 ,则该字节为ASCII字符,接下来的7位是Unicode Codepoint值本身。 If the topmost bit is 10 , then the 6 bits that follow are extra bits for a multi-byte sequence. 如果最高位为10 ,则后面的6位是多字节序列的额外位。 But the beginning of a multibyte sequence is coded with 11 in the top 2 bits. 但是多字节序列的开头在前2位中用11编码。

So if the top bits of a byte are not 10 , then it's either an ASCII character or the start of a multibyte sequence. 因此,如果一个字节的顶部位不是10 ,那么它可以是ASCII字符或多字节序列的开头。 Either way, it's a valid place to cut. 无论哪种方式,它都是一个有效的切割地点。

Note however that, while this algorithm will break the string at codepoint boundaries, it ignores Unicode grapheme clusters. 但请注意,虽然此算法将在代码点边界处中断字符串,但它会忽略Unicode字形集群。 This means that combining characters can be culled, away from the base characters that they combine with; 这意味着可以剔除组合字符,远离它们组合的基本字符; accents can be lost from characters, for example. 例如,重音可能会从字符中丢失。 Doing proper grapheme cluster analysis would require having access to the Unicode table that says whether a codepoint is a combining character. 进行正确的字形集群分析需要访问Unicode表,该表说明代码点是否为组合字符。

But it will at least be a valid Unicode UTF-8 string. 但它至少是一个有效的Unicode UTF-8字符串。 So that's better than most people do ;) 所以这比大多数人做得好;)


The code would look something like this (in C++14): 代码看起来像这样(在C ++ 14中):

auto FindCutPosition(const std::string &str, size_t max_size)
{
  assert(str.size() >= max_size, "Make sure stupidity hasn't happened.");
  assert(str.size() > 3, "Make sure stupidity hasn't happened.");
  max_size -= 3;
  for(size_t pos = max_size; pos > 0; --pos)
  {
    unsigned char byte = static_cast<unsigned char>(str[pos]); //Perfectly valid
    if(byte & 0xC0 != 0x80)
      return pos;
  }

  unsigned char byte = static_cast<unsigned char>(str[0]); //Perfectly valid
  if(byte & 0xC0 != 0x80)
    return 0;

  //If your first byte isn't even a valid UTF-8 starting point, then something terrible has happened.
  throw bad_utf8_encoded_text(...);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM