简体   繁体   中英

UnicodeDecodeError in json.dump

I have a complex JSON serializable data structure which contains somewhere in it unicode strings and utf-8 byte strings.

When I try to serialize the structure using ensure_ascii=False , it fails:

Python 2.7.5+ (default, Sep 19 2013, 13:48:49) 
[GCC 4.8.1] on linux2
>>> import json
>>> json.dumps(['\xd0\xb2', u'\xd0\xb2'], ensure_ascii=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/json/__init__.py", line 250, in dumps
    sort_keys=sort_keys, **kw).encode(obj)
  File "/usr/lib/python2.7/json/encoder.py", line 210, in encode
    return ''.join(chunks)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 1: ordinal not in range(128)
>>> 

I understand why this happens , but is there an easier or built-in way to make it work instead of recursively iterating over the data structure, finding byte strings and decoding them to unicode?

AFAIK the reason to serialize to JSON format is to store or to transfer some information. If you specify ensure_ascii = False , non-ascii characters are not encoded, which makes no sense at all, because you are looking to encode&serialize your data.

Basically you are trying to get a string with non encoded characters, which is not possible.

From the official docs:

If ensure_ascii is True (the default), all non-ASCII characters in the output are escaped with \\uXXXX sequences, and the result is a str instance consisting of ASCII characters only. If ensure_ascii is False, some chunks written to fp may be unicode instances. This usually happens because the input contains unicode strings or the encoding parameter is used. Unless fp.write() explicitly understands unicode (as in codecs.getwriter()) this is likely to cause an error.

On the other hand, the fact that you are designing an API does not tell that you don't have control over the input. An API is somehow a contract: if some input is given, some output is returned. So you can and should always specify what you are expecting.

In your case, you can check the elements one by one, and convert the bytestrings to unicode. This being said, my proposal is that you force your users to use unicode and dont specify ensure_ascii = False

For me the general rules to understand encoding and avoid problems are these:

  1. Strings within your code MUST be unicode.
  2. When importing data, DECODE it so that it is unicode. When exporting, ENCODE. This needs that both part agree the encoding they are using, otherwise you just get noise.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM