[英]detect/remove unpaired surrogate character in Python 2 + GTK
In Python 2.7 I can successfully convert the Unicode string "abc\?xyz"
to UTF-8 (result is "abc\\xed\\xb0\\xb4xyz"
). 在Python 2.7中,我可以成功地将Unicode字符串
"abc\?xyz"
为UTF-8(结果为"abc\\xed\\xb0\\xb4xyz"
)。 But when I pass the UTF-8 string to eg. 但是当我将UTF-8字符串传递给例如。
pango_parse_markup()
or g_convert_with_fallback()
, I get errors like "Invalid byte sequence in conversion input". pango_parse_markup()
或g_convert_with_fallback()
,我得到的错误如“转换输入中的字节序列无效”。 Apparently the GTK/Pango functions detect the "unpaired surrogate" in the string and (correctly?) reject it. 显然GTK / Pango函数检测字符串中的“不成对的代理”并且(正确地?)拒绝它。
Python 3 doesn't even allow conversion of the Unicode string to UTF-8 (error: "'utf-8' codec can't encode character '\?' in position 3: surrogates not allowed"), but I can run "abc\?xyz".encode("utf8", "replace")
to get a valid UTF8 string with the lone surrogate replaced by some other character. Python 3甚至不允许将Unicode字符串转换为UTF-8(错误:“'utf-8'编解码器不能编码位置3中的字符'\\ udc34':代理不允许”),但我可以运行
"abc\?xyz".encode("utf8", "replace")
获取有效的UTF8字符串,其中单独的代理替换为其他字符。 That's fine for me, but I need a solution for Python 2. 这对我来说很好,但我需要一个Python 2的解决方案。
So the question is: in Python 2.7, how can I convert that Unicode string to UTF-8 while replacing the lone surrogate with some replacement character like U+FFFD? 所以问题是:在Python 2.7中,如何将这个Unicode字符串转换为UTF-8,同时用U + FFFD等替换字符替换单独的代理? Preferably only standard Python functions and GTK/GLib/G... functions should be used.
最好只使用标准Python函数和GTK / GLib / G ...函数。
Btw. 顺便说一句。 Iconv can convert the string to UTF8 but simply removes the bad character instead of replacing it with U+FFFD.
Iconv可以将字符串转换为UTF8,但只是删除坏字符而不是用U + FFFD替换它。
You can do the replacements yourself before encoding: 您可以在编码之前自己进行替换:
import re
lone = re.compile(
ur'''(?x) # verbose expression (allows comments)
( # begin group
[\ud800-\udbff] # match leading surrogate
(?![\udc00-\udfff]) # but only if not followed by trailing surrogate
) # end group
| # OR
( # begin group
(?<![\ud800-\udbff]) # if not preceded by leading surrogate
[\udc00-\udfff] # match trailing surrogate
) # end group
''')
u = u'abc\ud834\ud82a\udfcdxyz'
print repr(u)
b = lone.sub(ur'\ufffd',u).encode('utf8')
print repr(b)
print repr(b.decode('utf8'))
Output: 输出:
u'abc\ud834\U0001abcdxyz'
'abc\xef\xbf\xbd\xf0\x9a\xaf\x8dxyz'
u'abc\ufffd\U0001abcdxyz'
Here's what fixed this issue for me: 以下是为我解决此问题的原因:
invalid_string.encode('utf16').decode('utf16', 'replace')
My understanding is that surrogate pairs are part of UTF-16, and that's why encoding/decoding with UTF-8 doesn't do anything. 我的理解是代理对是UTF-16的一部分,这就是使用UTF-8进行编码/解码的原因。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.