简体   繁体   中英

Is there a easy way to have a substring of a utf8 encode string, the substring's repr's length less than N in python

for example i have a string, I hope find a easy way to get a substring, which encode in utf-8, and the length of the repr of the substring is <= N, of course i can try N/3 substring and increase N/3+1, N/3+2,...,but if there is a easy way?

word = u"this is a ship, and some other words".encode("utf-8")
#some way got a substring
substring = func(word, N)
#assert len(repr(substring)) <= N

Thanks!

A possible approach:

  1. Take first N-1 bytes of the repr of the whole string.
  2. Examine last 3 bytes to see if you broke an escape sequence and cut of bytes if necessary
  3. Append a quote, keeping in mind that it may be ' or " .
  4. Eval the repr back to utf-8.
  5. Examine the last few bytes to see if you broke the string in the middle of a Unicode code point and cut out bytes if necessary. You can tell apart leading bytes and continuation bytes by examining the bit pattern.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM