简体繁体中英

Split a UTF-8 encoded string on blank characters without knowing about UTF-8 encoding

原文 2014-10-09 13:01:47 5 2 c++/ string/ encoding/ utf-8/ split

I would like to split a string at every blank character (' ', '\\n', '\\r', '\\t', '\\v', '\\f') The string is stored in UTF8 encoding in a byte array (char*, or vector or string, for instance)

Can I just split the byte array at each splitting character? Said otherwise, am I sure that the byte values corresponding to these characters cannot be found in a multi-byte character? By looking at the UTF-8 spec it seems all multibyte characters have only bytes higher than 128.

Thanks

2 answers

Yes, you can.

Multibyte sequences necessarily include one lead byte (the two MSBs equal to 11 ) and one ore more continuation bytes (two MSBs equal to 10 ). The total length of the multibyte sequence (lead byte+continuation bytes) is equal to the number of count of MSBs equal to 1 in the lead byte, before the first bit 0 appears (eg: if lead byte is 110xxxxx , exactly one continuation byte should follow; if it is 11110xxx , there should be exactly three continuation bytes).

So, if you find short MB sequences or stray continuationb bytes without a lead byte, your string is probably invalid anyway, and you split procedures probably wouldn't screw it any further than what it probably already was.

But there is something you might want to note: Unicode introduces other “blank” symbols in the upper, non-ASCII compatible ranges. You might want to treat them accordingly.

If you limit yourself to the set of whitespace characters you mention, the answer is definitely "yes".

Of course, there is always an issue of checking whether your text is valid UTF-8 in the first place...

utf-8 encoding a std::string?

std::string and UTF-8 encoded unicode

Convert wstring to string encoded in UTF-8

std::string is natively encoded in UTF-8 but char can not hold utf characters?

Attributes for UTF-8 characters

UTF-8 conversion for characters

Handle UTF-8 string

UTF-8 String Iterators

Utf-8 to URI percent encoding

How to convert a string encoded in utf16 to a string encoded in UTF-8?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question utf-8 encoding a std::string? std::string and UTF-8 encoded unicode Convert wstring to string encoded in UTF-8 std::string is natively encoded in UTF-8 but char can not hold utf characters? Attributes for UTF-8 characters UTF-8 conversion for characters Handle UTF-8 string UTF-8 String Iterators Utf-8 to URI percent encoding How to convert a string encoded in utf16 to a string encoded in UTF-8?

Related Tags

Split a UTF-8 encoded string on blank characters without knowing about UTF-8 encoding

Question

2 answers

solution1
4 ACCPTED 2014-10-09 13:31:04

solution2
2 2014-10-09 13:19:47

Split a UTF-8 encoded string on blank characters without knowing about UTF-8 encoding

Question

2 answers

solution1 4 ACCPTED 2014-10-09 13:31:04

solution2 2 2014-10-09 13:19:47

solution1
4 ACCPTED 2014-10-09 13:31:04

solution2
2 2014-10-09 13:19:47