简体   繁体   English

寻找将UTF8转换为UTF16的算法的描述

[英]Looking for the description of the algorithm to convert UTF8 to UTF16

I have 3 bytes representing an unicode char encoded in utf8.我有 3 个字节代表用 utf8 编码的 unicode 字符。 For example I have E2 82 AC (UTF8) that represent the unicode char € (U+20AC) .例如,我有E2 82 AC (UTF8) 代表 unicode char € (U+20AC) Is their any algorithm to make this conversion?他们有任何算法来进行这种转换吗? I know their is the windows api MultiByteToWideChar but I would like to know if their is a simple mathematical relation between E2 82 AC and U+20AC.我知道它们是 windows api MultiByteToWideChar 但我想知道它们是否是 E2 82 AC 和 U+20AC 之间的简单数学关系。 So is the mapping between utf8 -> utf16 a simple mathematic function or if it's a hardcoded map. utf8 -> utf16 之间的映射也是一个简单的数学 function 或者如果它是一个硬编码的 map。

Converting a valid UTF-8 byte sequence directly to UTF-16 is doable with a little mathematical know-how.只需一点数学知识,就可以将有效的 UTF-8 字节序列直接转换为 UTF-16。

Validating a UTF-8 byte sequence is trivial: simply check that the first byte matches one of the patterns below, and that (byte and $C0) = $80 is true for each subsequent byte in the sequence.验证 UTF-8 字节序列很简单:只需检查第一个字节是否与以下模式之一匹配,并且(byte and $C0) = $80对于序列中的每个后续字节都是正确的。

The first byte in a UTF-8 sequence tells you how many bytes are in the sequence: UTF-8 序列中的第一个字节告诉您序列中有多少字节:

(byte1 and $80) = $00: 1 byte
(byte1 and $E0) = $C0: 2 bytes
(byte1 and $F0) = $E0: 3 bytes
(byte1 and $F8) = $F0: 4 bytes
anything else: error

There are very simple formulas for converting UTF-8 1-byte, 2-byte, and 3-byte sequences to UTF-16, as they all represent Unicode codepoints below U+10000 , and thus can be represented as-is in UTF-16 using just one 16-bit codeunit, no surrogates needed, eg:有非常简单的公式可以将 UTF-8 1 字节、2 字节和 3 字节序列转换为 UTF-16,因为它们都表示低于U+10000的 Unicode 代码点,因此可以使用 UTF-16 原样表示只需一个 16 位代码单元,无需代理,例如:

1 byte: 1 个字节:

UTF16 = UInt16(byte1 and $7F)

2 bytes: 2个字节:

UTF16 = (UInt16(byte1 and $1F) shl 6)
        or UInt16(byte2 and $3F)

3 bytes: 3 个字节:

UTF16 = (UInt16(byte1 and $0F) shl 12)
        or (UInt16(byte2 and $3F) shl 6)
        or UInt16(byte3 and $3F)

Converting a UTF-8 4-byte sequence to UTF-16 is slightly more involved, since all Unicode codepoints it represents will need UTF-16 surrogates, which requires an additional step to calculate, eg:将 UTF-8 4 字节序列转换为 UTF-16 稍微复杂一些,因为它表示的所有 Unicode 代码点都需要 UTF-16 代理,这需要额外的步骤来计算,例如:

4 bytes: 4字节:

CP = (UInt32(byte1 and $07) shl 18)
     or (UInt32(byte2 and $3F) shl 12)
     or (UInt32(byte3 and $3F) shl 6)
     or UInt32(byte4 and $3F)
CP = CP - $10000
highSurrogate = $D800 + UInt16((CP shr 10) and $3FF)
lowSurrogate = $DC00 + UInt16(CP and $3FF)
UTF16 = highSurrogate, lowSurrogate

Now, with that said, let's look at your example: E2 82 AC现在,话虽如此,让我们看一下您的示例: E2 82 AC

The first byte is ($E2 and $F0) = $E0 , the second byte is ($82 and $C0) = $80 , and the third byte is ($AC and $C0) = $80 , so this is indeed a valid UTF-8 3-byte sequence.第一个字节是($E2 and $F0) = $E0 ,第二个字节是($82 and $C0) = $80 ,第三个字节是($AC and $C0) = $80 ,所以这确实是一个有效的 UTF-8 3 字节序列。

Plugging in those byte values into the 3-byte formula, you get:将这些字节值代入 3 字节公式,您将得到:

UTF16 = (UInt16($E2 and $0F) shl 12)
        or (UInt16($82 and $3F) shl 6)
        or UInt16($AC and $3F)

      = (UInt16($02) shl 12)
        or (UInt16($02) shl 6)
        or UInt16($2C)

      = $2000
        or $80
        or $2C

      = $20AC

And indeed, Unicode codepoint U+20AC is encoded in UTF-16 as $20AC .实际上, Unicode 代码点U+20AC在 UTF-16 中编码为$20AC

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM