简体   繁体   English

C:确定UTF-8字符串中UTF-16字符串需要多少字节的最有效方法

[英]C: Most efficient way to determine how many bytes will be needed for a UTF-16 string from a UTF-8 string

I've seen some very clever code out there for converting between Unicode codepoints and UTF-8 so I was wondering if anybody has (or would enjoy devising) this. 我已经看到一些非常聪明的代码用于在Unicode代码点和UTF-8之间进行转换,所以我想知道是否有人(或者会喜欢设计)这个。

  • Given a UTF-8 string, how many bytes are needed for the UTF-16 encoding of the same string. 给定UTF-8字符串,相同字符串的UTF-16编码需要多少字节。
  • Assume the UTF-8 string has already been validated. 假设UTF-8字符串已经过验证。 It has no BOM, no overlong sequences, no invalid sequences, is null-terminated. 它没有BOM,没有超长序列,没有无效序列,是空终止的。 It is not CESU-8 . 它不是CESU-8
  • Full UTF-16 with surrogates must be supported. 必须支持带代理的完整UTF-16。

Specifically I wonder if there are shortcuts to knowing when a surrogate pair will be needed without fully converting the UTF-8 sequence into a codepoint. 具体来说,我想知道是否有快捷方式可以在不完全将UTF-8序列转换为代码点的情况下知道何时需要代理对。

The best UTF-8 to codepoint code I've seen uses vectorizing techniques so I wonder if that's also possible here. 我见过的最好的UTF-8代码点代码使用了矢量化技术,所以我想知道这是否也可以。

Efficiency is always a speed vs size tradeoff. 效率始终是速度与尺寸的权衡。 If speed is favored over size then the most efficient way is just to guess based on the length of the source string. 如果速度优于大小,则最有效的方法是根据源字符串的长度进行猜测。

There are 4 cases that need to be considered, simply take the worst case as the final buffer size: 有4种情况需要考虑,只需将最坏情况作为最终缓冲区大小:

  • U+0000-U+007F - will encode to 1byte in utf8, and 2bytes per character in utf16. U + 0000-U + 007F - 将在utf8中编码为1字节,在utf16中编码为每字符2字节。 (1:2 = x2) (1:2 = x2)
  • U+0080-U+07FF - encoded to 2byte utf8 sequences, or 2byte per character utf16 characters. U + 0080-U + 07FF - 编码为2byte utf8序列,或每字符2字节utf16个字符。 (2:2 = x1) (2:2 = x1)
  • U+0800-U+FFFF - are stored as 3byte utf8 sequences, but still fit in single utf16 characters. U + 0800-U + FFFF - 存储为3byte utf8序列,但仍然适合单个utf16字符。 (3:2 = x.67) (3:2 = x.67)
  • U+10000-U+10FFFF - are stored as 4byte utf8 sequences, or surrogate pairs in utf16. U + 10000-U + 10FFFF - 存储为4byte utf8序列或utf16中的代理对。 (4:4 = x1) (4:4 = x1)

The worse case expansion factor is when translating U+0000-U+007f from utf8 to utf16: the buffer, bytewise, merely has to be twice as large as the source string. 更糟糕的情况扩展因子是将U + 0000-U + 007f从utf8转换为utf16时:缓冲区(字节方式)只需要是源字符串的两倍。 Every other unicode codepoint results in an equal size, or smaller bytewise allocation when encoded as utf16 as utf8. 当编码为utf16为utf8时,每个其他unicode代码点都会产生相同的大小或更小的字节分配。

Very simple: count the number of head bytes, double-counting bytes F0 and up. 非常简单:计算头字节数,重复计算字节F0和向上。

In code: 在代码中:

size_t count(unsigned char *s)
{
    size_t l;
    for (l=0; *s; s++) l+=(*s-0x80U>=0x40)+(*s>=0xf0);
    return l;
}

Note: This function returns the length in UTF-16 code units. 注意:此函数以UTF-16代码单位返回长度。 If you want the number of bytes needed, multiply by 2. If you're going to store a null terminator you'll also need to account for space for that (one extra code unit/two extra bytes). 如果你想要所需的字节数,乘以2.如果你要存储一个空终止符,你还需要考虑空间(一个额外的代码单元/两个额外的字节)。

It's not an algorithm, but if I understand correctly the rules are as such: 它不是算法,但如果我理解正确的规则是这样的:

  • every byte having a MSB of 0 adds 2 bytes (1 UTF-16 code unit) 每个MSB为0字节加2个字节(1个UTF-16代码单元)
    • that byte represents a single Unicode codepoint in the range U+0000 - U+007F 该字节表示U + 0000 - U + 007F范围内的单个Unicode代码点
  • every byte having the MSBs 110 or 1110 adds 2 bytes (1 UTF-16 code unit) 具有MSB 1101110每个字节添加2个字节(1个UTF-16代码单元)
    • these bytes start 2- and 3-byte sequences respectively which represent Unicode codepoints in the range U+0080 - U+FFFF 这些字节分别开始2和3字节序列,它们代表U + 0080-U + FFFF范围内的Unicode代码点
  • every byte having the 4 MSB set (ie starting with 1111 ) adds 4 bytes (2 UTF-16 code units) 具有4个MSB集的每个字节(即从1111开始)增加4个字节(2个UTF-16代码单元)
    • these bytes start 4-byte sequences which cover "the rest" of the Unicode range, which can be represented with a low and high surrogate in UTF-16 这些字节开始包含Unicode范围的“其余”的4字节序列,可以用UTF-16中的低和高代理表示
  • every other byte (ie those starting with 10 ) can be skipped 可以跳过每隔一个字节(即以10开头的字节)
    • these bytes are already counted with the others. 这些字节已经与其他字节一起计算。

I'm not a C expert, but this looks easily vectorizable. 我不是C专家,但这看起来很容易上传。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM