简体   繁体   English

在 JSON 字符串中解码 UTF-8 编码

[英]Decode UTF-8 encoding in JSON string

I have JSON file which contains followingly encoded strings:我有包含以下编码字符串的 JSON 文件:

"sender_name": "Horn\Ã\­kov\Ã\¡",

I am trying to parse this file using the json module.我正在尝试使用json模块解析此文件。 However I am not able to decode this string correctly.但是我无法正确解码此字符串。

What I get after decoding the JSON using .load() method is 'HornÃ\\xadková' .使用.load()方法解码 JSON 后我得到的是'HornÃ\\xadková' The string should be correctly decoded as 'Horníková' instead.该字符串应正确解码为'Horníková'

I read the JSON specification and I understasnd that after \\u\u003c/code> there should be 4 hexadecimal numbers specifing Unicode number of character.我阅读了 JSON 规范,我理解在\\u\u003c/code>之后应该有 4 个十六进制数字指定Unicode 字符数 But it seems that in this JSON file UTF-8 encoded bytes are stored as \\u\u003c/code> -sequences.但似乎在这个 JSON 文件中, UTF-8 编码的字节存储为\\u\u003c/code> -sequences。

What type of encoding is this and how to correctly parse it in Python 3?这是什么类型的编码以及如何在 Python 3 中正确解析它?

Is this type JSON file even valid JSON file according to the specification?根据规范,这种类型的 JSON 文件甚至是有效的 JSON 文件吗?

Your text is already encoded and you need to tell this to Python by using a b prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually.您的文本已经编码,您需要通过在字符串中使用b前缀将其告诉 Python,但由于您使用的是 json 并且输入需要是字符串,因此您必须手动解码编码文本。 Since your input is not byte you can use 'raw_unicode_escape' encoding to convert the string to byte without encoding and prevent the open method to use its own default encoding.由于您的输入不是字节,因此您可以使用'raw_unicode_escape'编码将字符串转换为字节而不进行编码,并防止open方法使用其自己的默认编码。 Then you can simply use aforementioned approach to get the desired result.然后您可以简单地使用上述方法来获得所需的结果。

Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads() instead of json.load() .请注意,由于您需要进行编码和解码才能读取文件内容并对加载的字符串执行编码,因此您应该使用json.loads()而不是json.load()

In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
     ...:     d = json.loads(f.read().encode('raw_unicode_escape').decode())
     ...:     

In [169]: d
Out[169]: {'sender_name': 'Horníková'}

The JSON you are reading was written incorrectly and the Unicode strings decoded from it will have to be re-encoded with the wrong encoding used, then decoded with the correct encoding.您正在读取的 JSON 写入不正确,从它解码的 Unicode 字符串必须使用错误的编码重新编码,然后使用正确的编码进行解码。

Here's an example:下面是一个例子:

#!python3
import json

# The bad JSON you have
bad_json = r'{"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}'
print('bad_json =',bad_json)

# The wanted result from json.loads()
wanted = {'sender_name':'Horníková'}

# What correctly written JSON should look like
good_json = json.dumps(wanted)
print('good_json =',good_json)

# What you get when loading the bad JSON.
got = json.loads(bad_json)
print('wanted =',wanted)
print('got =',got)

# How to correct the mojibake string
corrected_sender = got['sender_name'].encode('latin1').decode('utf8')
print('corrected_sender =',corrected_sender)

Output:输出:

bad_json = {"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}
good_json = {"sender_name": "Horn\u00edkov\u00e1"}
wanted = {'sender_name': 'Horníková'}
got = {'sender_name': 'HornÃ\xadková'}
corrected_sender = Horníková

I don't know enough about JSON to be able to say whether this is valid or not, but you can parse these strings using the raw_unicode_escape codec:我对 JSON 的了解不够,无法判断这是否有效,但您可以使用raw_unicode_escape编解码器解析这些字符串:

>>> "Horn\u00c3\u00adkov\u00c3\u00a1".encode('raw_unicode_escape').decode('utf8')
'Horníková'

Reencode to bytes, and then redecode to text.重新编码为字节,然后重新解码为文本。

>>> 'HornÃ\xadková'.encode('latin-1').decode('utf-8')
'Horníková'

Is this type JSON file even valid JSON file according to the specification?根据规范,这种类型的 JSON 文件甚至是有效的 JSON 文件吗?

No.不。

A string is a sequence of zero or more Unicode characters , wrapped in double quotes, using backslash escapes [emphasis added] .字符串是零个或多个Unicode 字符的序列,用双引号括起来,使用反斜杠转义[强调已添加]

source来源

A string is a sequence of Unicode code points wrapped with quotation marks (U+0022).字符串是用引号 (U+0022) 包裹的Unicode 代码点序列。 [...] Any code point may be represented as a hexadecimal escape sequence [...] represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point [emphasis added] . [...] 任何代码点都可以表示为十六进制转义序列 [...] 表示为六个字符的序列:反斜杠,后跟小写字母 u,后跟四个对代码点进行编码的十六进制数字[重点补充]

source 来源

UTF-8 byte sequences are neither Unicode characters nor Unicode code points. UTF-8 字节序列既不是 Unicode 字符也不是 Unicode 代码点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM