简体   繁体   English

为什么Python json.dumps在utf-8和unicode混合字符串上失败?

[英]Why does Python json.dumps fail on mixed utf-8 & unicode strings?

Python (2.x) builtin json library supports encoding both unicode & utf-8 encoded (non-ASCII) strings - but apparently not at the same time. Python(2.x)内置json库支持对unicode和utf-8编码(非ASCII)字符串进行编码-但显然不是同时进行的。 Try: 尝试:

import json; json.dumps([u'Ä', u'Ä'.encode("utf-8")], ensure_ascii=False)

and see it raise a UnicodeDecodeError. 并看到它引发了UnicodeDecodeError。 Whereas both: 两者:

json.dumps([u'Ä'], ensure_ascii=False)

and

json.dumps([u'Ä'.encode("utf-8")], ensure_ascii=False)

...work ok. ...工作正常。

Why does JSON encoding of data with both unicode & utf-8 encoded (non-ASCII) strings produce an UnicodeDecodeError? 为什么同时用unicode和utf-8编码(非ASCII)字符串的数据进行JSON编码会产生UnicodeDecodeError? My Python site encoding is ASCII. 我的Python网站编码是ASCII。

It doesn't work because it doesn't know what kind of output string to produce. 它不起作用,因为它不知道要产生哪种输出字符串。

In my Python 2.7: 在我的Python 2.7中:

>>> json.dumps([u'Ä'], ensure_ascii=False)
u'["\xc4"]'

(a Unicode string) (Unicode字符串)

and

>>> json.dumps([u'Ä'.encode("utf-8")], ensure_ascii=False)
'["\xc3\x84"]'

(a UTF8-encoded byte string) (UTF8编码的字节字符串)

So if you give it UTF8-encoded byte strings, it produces a UTF8-encoded byte string JSON, and if you give it Unicode strings, it produces a Unicode JSON. 因此,如果给它提供UTF8编码的字节字符串,它将产生一个UTF8编码的字节字符串JSON,如果给它提供Unicode字符串,它将产生一个Unicode JSON。

If you mix them, it can't do both. 如果将它们混合使用,则不能同时进行。

To fix this, you can give an explicit encoding argument (even though the default is correct) and it seems that it makes the result a unicode string always then: 为了解决这个问题,您可以提供一个显式的编码参数(即使默认设置正确),并且看起来总是使结果成为unicode字符串:

>>> import json; json.dumps([u'Ä', u'Ä'.encode("utf-8")], ensure_ascii=False, encoding="UTF8")
u'["\xc4", "\xc4"]'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM