截斷 UTF-16 字符串

Question

我維護一個 Python 庫，它驗證並准備下游 Java 服務的輸入。 因此，圖書館內的預驗證需要與下游服務保持一致。 這里的一個痛點是計算某些 Unicode 字符串的字符串長度。

Python 計算字符數以確定字符串的長度，而 Java 計算代碼單元（即 UTF-16 代理項對）。 通常這些計算是相同的，但在基本多語言平面之外，這些計算可能不同。 例如，字符串“wink”在 Python 中的長度為 6，在 Java 中的長度為 7（表情符號為 2 + 其他字符為 5）。

因此，要復制 Java 的長度計算方法，我們需要編碼為 UTF-16，然后除以 2：

field_value = "wink 😉"    
len(field_value.encode("utf-16-le")) // 2

但是，如果我想根據 UTF-16 代碼對方法將輸入字符串截斷為最大允許字符限制，這將更具挑戰性。 轉換為 UTF-16 然后切片過於熱心，因為並非所有字符都在 BMP 之外：

field_value = "wink 😉"  
field_value.encode("utf-16-le")[:LIMIT].decode("utf-16-le", "ignore")

在 Python 中根據此字符權重截斷 Unicode 字符串（包含 BMP + 后 BMP 字符）的有效方法是什么？

Answer 1

這里有一個 function 將在字符串中的有效代碼點截斷。 它的工作原理是測試太長的字符串不會在代理對的中間被截斷。 它基於我對截斷 UTF-8 的類似回答。 請注意，這不處理字形。 如果需要測試截斷修飾符，您可以使用unicodedata.category() 。

s = 'A 😉 short 😉😉 test'

def utf16_trailing_surrogate(b):
    '''The high byte of a UTF-16 trailing surrogate starts with the bits 110111xx.'''
    return (b & 0b1111_1100) == 0b1101_1100

def utf16_byte_truncate(text, max_bytes):
    '''If text[max_bytes:max_bytes+1] is a trailing surrogate, back up two bytes and truncate.
    '''
    i = max_bytes - max_bytes % 2  # make even
    utf16 = text.encode('utf-16le')
    if len(utf16) <= i: # does it fit
        return utf16
    if utf16_trailing_surrogate(utf16[i+1]):
        i -= 2
    return utf16[:i]

# test for various max_bytes:
for m in range(len(s.encode('utf-16le'))+1):
    b = utf16_byte_truncate(s,m)
    print(f'{m:2} {len(b):2} {b.decode("utf-16le")!r}')

Output：

 0  0 ''
 1  0 ''
 2  2 'A'
 3  2 'A'
 4  4 'A '
 5  4 'A '
 6  4 'A '
 7  4 'A '
 8  8 'A 😉'
 9  8 'A 😉'
10 10 'A 😉 '
11 10 'A 😉 '
12 12 'A 😉 s'
13 12 'A 😉 s'
14 14 'A 😉 sh'
15 14 'A 😉 sh'
16 16 'A 😉 sho'
17 16 'A 😉 sho'
18 18 'A 😉 shor'
19 18 'A 😉 shor'
20 20 'A 😉 short'
21 20 'A 😉 short'
22 22 'A 😉 short '
23 22 'A 😉 short '
24 22 'A 😉 short '
25 22 'A 😉 short '
26 26 'A 😉 short 😉'
27 26 'A 😉 short 😉'
28 26 'A 😉 short 😉'
29 26 'A 😉 short 😉'
30 30 'A 😉 short 😉😉'
31 30 'A 😉 short 😉😉'
32 32 'A 😉 short 😉😉 '
33 32 'A 😉 short 😉😉 '
34 34 'A 😉 short 😉😉 t'
35 34 'A 😉 short 😉😉 t'
36 36 'A 😉 short 😉😉 te'
37 36 'A 😉 short 😉😉 te'
38 38 'A 😉 short 😉😉 tes'
39 38 'A 😉 short 😉😉 tes'
40 40 'A 😉 short 😉😉 test'

截斷 UTF-16 字符串

問題描述

1 個解決方案

解決方案1
1 2021-08-24 17:10:15

截斷 UTF-16 字符串

問題描述

1 個解決方案

解決方案1 1 2021-08-24 17:10:15

解決方案1
1 2021-08-24 17:10:15