简体   繁体   English

ValueError:在读取json文件时解码“字符串”时未配对的高代理

[英]ValueError: Unpaired high surrogate when decoding 'string' on reading json file

I am currently working on python 3.8.6.我目前正在研究 python 3.8.6。 I am getting the following error on reading (thousands of) json files in python:在 python 中读取(数千个)json 文件时出现以下错误:

ValueError: Unpaired high surrogate when decoding 'string' on reading json file

I tried using the following solutions while checking other stackoverflow posts but nothing worked:我在检查其他 stackoverflow 帖子时尝试使用以下解决方案,但没有任何效果:

1) import json
   json.loads('{"":"\\ud800"}')

2) import simplejson
   simplejson.loads('{"":"\\ud800"}')

The problem is that after getting this error the remaining json files are not read.问题是在收到此错误后,不会读取剩余的 json 文件。 Is there a way to get rid of this error so I can read all the json files?有没有办法摆脱这个错误,这样我就可以读取所有的 json 文件?

I am not sure what all information is necessary to provide regarding the problem so please feel free to ask.我不确定需要提供哪些有关该问题的所有信息,因此请随时提问。

Unicode code point U+D800 may only occur as part of a surrogate pair (and then only in UTF-16 encoding). Unicode 代码点 U+D800可能仅作为代理对的一部分出现(并且仅在 UTF-16 编码中)。 So that string inside the JSON is (after decoding it) not valid UTF-8.因此,JSON 中的字符串(解码后)不是有效的 UTF-8。

The JSON itself might or might not be valid. JSON 本身可能有效,也可能无效。 The spec doesn't mention the case of unmatched surrogate pairs, but does explicitly allow nonexistent code points: 规范没有提到不匹配代理对的情况,但明确允许不存在的代码点:

To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point.要转义不在基本多语言平面中的代码点,可以将字符表示为十二个字符的序列,对与代码点对应的 UTF-16 代理对进行编码。 So for example, a string containing only the G clef character (U+1D11E) may be represented as "\?\?".因此,例如,仅包含 G 谱号字符 (U+1D11E) 的字符串可以表示为“\?\?”。 However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor.但是,JSON 文本的处理器是将此类代理对解释为单个代码点还是显式代理对是由特定处理器确定的语义决定。

Note that the JSON grammar permits code points for which Unicode does not currently provide character assignments.请注意,JSON 语法允许 Unicode 当前不提供字符分配的代码点。

Now, you can choose your friends, but you can't choose your family and you can't always choose your JSON either.现在,你可以选择你的朋友,但你不能选择你的家人,你也不能总是选择你的 JSON。 So the next question is: how to parse this mess?所以下一个问题是:如何解析这个烂摊子?

It looks like both the built-in json module in Python (version 3.9) and simplejson (version 3.17.2) have no problems parsing the JSON.看起来 Python(3.9 版)和simplejson (3.17.2 版)中的内置json模块在解析 JSON 时都没有问题。 The problem only occurs once you try to use the string.只有在您尝试使用字符串时才会出现此问题。 So this really doesn't have anything to do with JSON at all:所以这真的与 JSON 没有任何关系:

>>> bork = '\ud800'
>>> bork
'\ud800'
>>> print(bork)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

Fortunately, we can encode the string manually and tell Python how to handle the error.幸运的是,我们可以手动编码字符串并告诉 Python 如何处理错误。 For example, replace the erroneous code point with a question mark:例如,将错误的代码点替换为问号:

>>> bork.encode('utf-8', errors='replace')
b'?'

The documentation lists other possible options for the errors argument.该文档列出了errors参数的其他可能选项

To fix up this broken string, we can encode (into bytes ) and then decode (back into str ):为了修复这个损坏的字符串,我们可以编码(到bytes )然后解码(回到str ):

>>> bork.encode('utf-8', errors='replace').decode('utf-8')
'?'

A Unicode surrogate in isolation does not correspond to anything.孤立的 Unicode 代理不对应任何东西。 Every valid high surrogate code point needs to be immediately followed by a low surrogate code point before it can be meaningfully decoded.每个有效的高代理代码点都需要紧跟一个低代理代码点,然后才能对其进行有意义的解码。

The error message simply means that this code point in isolation does not have a well-defined meaning.错误消息仅表示此代码点孤立地没有明确定义的含义。 It's like saying "take" without saying what we should take, or "look at" without the object of the sentence filled in.这就像说“拿”而不说我们应该拿什么,或者“看”而不填写句子的宾语。

You should not be using surrogates in files which do not contain UTF-16 anyway;你不应该在不包含 UTF-16 的文件中使用代理; they are reserved strictly for this encoding.它们是为这种编码严格保留的。 It is used for encoding characters outside the 16-bit space which this 16-bit encoding can naturally represent by way of splitting them across two code points.它用于对 16 位空间之外的字符进行编码,这种 16 位编码可以通过将它们分成两个代码点来自然地表示。

The simple and obvious fix is to supply the missing information, but we can't know what it is.简单而明显的解决方法是提供缺失的信息,但我们不知道它是什么。 Perhaps you have more context, and can fill in with a correct low surrogate pair.也许您有更多上下文,并且可以填写正确的低代理对。 But for example, this works:但例如,这有效:

>>> json.loads('{"":"\\ud800\\udc00"}')
{'': '𐀀'}

It populates the JSON with the single code point U+010000 but of course we can have no idea whether that's actually the code point your data should contain.它使用单个代码点U+010000填充 JSON,但当然我们不知道这是否实际上是您的数据应该包含的代码点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM