简体   繁体   English

在Python中解码双重编码的utf8

[英]Decoding double encoded utf8 in Python

I've got a problem with strings that I get from one of my clients over xmlrpc. 我遇到的问题是我通过xmlrpc从我的一个客户端获得的字符串。 He sends me utf8 strings that are encoded twice :( so when I get them in python I have an unicode object that has to be decoded one more time, but obviously python doesn't allow that. I've noticed my client however I need to do quick workaround for now before he fixes it. 他向我发送了两次编码的utf8字符串:(所以当我在python中得到它们时,我有一个unicode对象,必须再次解码,但很明显python不允许这样做。我注意到了我的客户端但是我需要在修复它之前,现在就做快速的解决方法。

Raw string from tcp dump: 来自tcp转储的原始字符串:

<string>Rafa\xc3\x85\xc2\x82</string>

this is converted into: 这被转换成:

u'Rafa\xc5\x82'

The best we get is: 我们得到的最好的是:

eval(repr(u'Rafa\xc5\x82')[1:]).decode("utf8") 

This results in correct string which is: 这导致正确的字符串是:

u'Rafa\u0142' 

this works however is ugly as hell and cannot be used in production code. 然而,这种作品很丑陋,不能在生产代码中使用。 If anyone knows how to fix this problem in more suitable way please write. 如果有人知道如何以更合适的方式解决这个问题,请写信。 Thanks, Chris 谢谢,克里斯

>>> s = u'Rafa\xc5\x82'
>>> s.encode('raw_unicode_escape').decode('utf-8')
u'Rafa\u0142'
>>>

Yow, that was fun! 哇,这很有趣!

>>> original = "Rafa\xc3\x85\xc2\x82"
>>> first_decode = original.decode('utf-8')
>>> as_chars = ''.join([chr(ord(x)) for x in first_decode])
>>> result = as_chars.decode('utf-8')
>>> result
u'Rafa\u0142'

So you do the first decode, getting a Unicode string where each character is actually a UTF-8 byte value. 所以你做第一次解码,得到一个Unicode字符串,其中每个字符实际上是一个UTF-8字节值。 You go via the integer value of each of those characters to get back to a genuine UTF-8 string, which you then decode as normal. 您可以通过每个字符的整数值返回到真正的UTF-8字符串,然后将其正常解码。

>>> weird = u'Rafa\xc5\x82'
>>> weird.encode('latin1').decode('utf8')
u'Rafa\u0142'
>>>

latin1 is just an abbreviation for Richie's nuts'n'bolts method. latin1只是Richie's nuts'n'bolts方法的缩写。

It is very curious that the seriously under-described raw_unicode_escape codec gives the same result as latin1 in this case. 非常奇怪的是,严重欠描述的raw_unicode_escape编解码器在这种情况下给出了与latin1相同的结果。 Do they always give the same result? 他们总是给出相同的结果吗? If so, why have such a codec? 如果是这样,为什么要有这样的编解码器? If not, it would preferable to know for sure exactly how the OP's client did the transformation from 'Rafa\\xc5\\x82' to u'Rafa\\xc5\\x82' and then to reverse that process exactly -- otherwise we might come unstuck if different data crops up before the double encoding is fixed. 如果没有,最好确切地知道OP的客户是如何从'Rafa\\xc5\\x82'u'Rafa\\xc5\\x82'然后完全反转这个过程 - 否则我们可能会失败,如果在修复双重编码之前,会生成不同的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM