简体   繁体   中英

Split a UTF-8 encoded string on blank characters without knowing about UTF-8 encoding

I would like to split a string at every blank character (' ', '\\n', '\\r', '\\t', '\\v', '\\f') The string is stored in UTF8 encoding in a byte array (char*, or vector or string, for instance)

Can I just split the byte array at each splitting character? Said otherwise, am I sure that the byte values corresponding to these characters cannot be found in a multi-byte character? By looking at the UTF-8 spec it seems all multibyte characters have only bytes higher than 128.

Thanks

Yes, you can.

Multibyte sequences necessarily include one lead byte (the two MSBs equal to 11 ) and one ore more continuation bytes (two MSBs equal to 10 ). The total length of the multibyte sequence (lead byte+continuation bytes) is equal to the number of count of MSBs equal to 1 in the lead byte, before the first bit 0 appears (eg: if lead byte is 110xxxxx , exactly one continuation byte should follow; if it is 11110xxx , there should be exactly three continuation bytes).

So, if you find short MB sequences or stray continuationb bytes without a lead byte, your string is probably invalid anyway, and you split procedures probably wouldn't screw it any further than what it probably already was.

But there is something you might want to note: Unicode introduces other “blank” symbols in the upper, non-ASCII compatible ranges. You might want to treat them accordingly.

If you limit yourself to the set of whitespace characters you mention, the answer is definitely "yes".

Of course, there is always an issue of checking whether your text is valid UTF-8 in the first place...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM