[英]C: Most efficient way to determine how many bytes will be needed for a UTF-16 string from a UTF-8 string
I've seen some very clever code out there for converting between Unicode codepoints and UTF-8 so I was wondering if anybody has (or would enjoy devising) this. 我已经看到一些非常聪明的代码用于在Unicode代码点和UTF-8之间进行转换,所以我想知道是否有人(或者会喜欢设计)这个。
Specifically I wonder if there are shortcuts to knowing when a surrogate pair will be needed without fully converting the UTF-8 sequence into a codepoint. 具体来说,我想知道是否有快捷方式可以在不完全将UTF-8序列转换为代码点的情况下知道何时需要代理对。
The best UTF-8 to codepoint code I've seen uses vectorizing techniques so I wonder if that's also possible here. 我见过的最好的UTF-8代码点代码使用了矢量化技术,所以我想知道这是否也可以。
Efficiency is always a speed vs size tradeoff. 效率始终是速度与尺寸的权衡。 If speed is favored over size then the most efficient way is just to guess based on the length of the source string.
如果速度优于大小,则最有效的方法是根据源字符串的长度进行猜测。
There are 4 cases that need to be considered, simply take the worst case as the final buffer size: 有4种情况需要考虑,只需将最坏情况作为最终缓冲区大小:
The worse case expansion factor is when translating U+0000-U+007f from utf8 to utf16: the buffer, bytewise, merely has to be twice as large as the source string. 更糟糕的情况扩展因子是将U + 0000-U + 007f从utf8转换为utf16时:缓冲区(字节方式)只需要是源字符串的两倍。 Every other unicode codepoint results in an equal size, or smaller bytewise allocation when encoded as utf16 as utf8.
当编码为utf16为utf8时,每个其他unicode代码点都会产生相同的大小或更小的字节分配。
Very simple: count the number of head bytes, double-counting bytes F0
and up. 非常简单:计算头字节数,重复计算字节
F0
和向上。
In code: 在代码中:
size_t count(unsigned char *s)
{
size_t l;
for (l=0; *s; s++) l+=(*s-0x80U>=0x40)+(*s>=0xf0);
return l;
}
Note: This function returns the length in UTF-16 code units. 注意:此函数以UTF-16代码单位返回长度。 If you want the number of bytes needed, multiply by 2. If you're going to store a null terminator you'll also need to account for space for that (one extra code unit/two extra bytes).
如果你想要所需的字节数,乘以2.如果你要存储一个空终止符,你还需要考虑空间(一个额外的代码单元/两个额外的字节)。
It's not an algorithm, but if I understand correctly the rules are as such: 它不是算法,但如果我理解正确的规则是这样的:
0
adds 2 bytes (1 UTF-16 code unit) 0
字节加2个字节(1个UTF-16代码单元)
110
or 1110
adds 2 bytes (1 UTF-16 code unit) 110
或1110
每个字节添加2个字节(1个UTF-16代码单元)
1111
) adds 4 bytes (2 UTF-16 code units) 1111
开始)增加4个字节(2个UTF-16代码单元)
10
) can be skipped 10
开头的字节)
I'm not a C expert, but this looks easily vectorizable. 我不是C专家,但这看起来很容易上传。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.