简体   繁体   English

是否有一种简单的方法来拥有utf8编码字符串的子字符串,该子字符串的repr长度小于python中的N

[英]Is there a easy way to have a substring of a utf8 encode string, the substring's repr's length less than N in python

for example i have a string, I hope find a easy way to get a substring, which encode in utf-8, and the length of the repr of the substring is <= N, of course i can try N/3 substring and increase N/3+1, N/3+2,...,but if there is a easy way? 例如我有一个字符串,我希望找到一种简单的方法来获取以utf-8编码的子字符串,并且该子字符串的repr的长度为<= N,当然我可以尝试使用N / 3子字符串并增加N / 3 + 1,N / 3 + 2,...,但是有没有简单的方法?

word = u"this is a ship, and some other words".encode("utf-8")
#some way got a substring
substring = func(word, N)
#assert len(repr(substring)) <= N

Thanks! 谢谢!

A possible approach: 可能的方法:

  1. Take first N-1 bytes of the repr of the whole string. 取整个字符串的repr的前N-1个字节。
  2. Examine last 3 bytes to see if you broke an escape sequence and cut of bytes if necessary 检查最后3个字节,以查看是否中断了转义序列并在必要时削减了字节
  3. Append a quote, keeping in mind that it may be ' or " . 请加上引号,并记住它可能是'"
  4. Eval the repr back to utf-8. 评估代表回到utf-8。
  5. Examine the last few bytes to see if you broke the string in the middle of a Unicode code point and cut out bytes if necessary. 检查最后几个字节,看看是否在Unicode代码点的中间中断了字符串,并在必要时切出了字节。 You can tell apart leading bytes and continuation bytes by examining the bit pattern. 您可以通过检查位模式来区分前导字节和连续字节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM