使用 json.loads() 保留 unicode 字符或在执行 json.dumps() 时将它们转换回

Question

I have a json file that contains the unicode characters \< and \> .我有一个 json 文件，其中包含 unicode 字符\<和\> \< 。 When loading the file with json.load() these characters get converted to < and > .使用 json.load() 加载文件时，这些字符会转换为<和> 。 Consider the following experiment:考虑以下实验：

d = json.loads('"Foo \u003cfoo@bar.net\u003e"')

Which then prints like:然后打印如下：

'Foo <foo@bar.net>'

Say that I need to dump this back to a file and need to have the characters < and > converted back to \< and \> .假设我需要将其转储回文件，并且需要将字符<和>转换回\<和\> \< 。 I am currently using f.write(json.dumps(d)) but that does not seem to work.我目前正在使用f.write(json.dumps(d))但这似乎不起作用。

I have searched for hours but am just not able to figure this out.我已经搜索了几个小时，但无法弄清楚这一点。

Answer 1

Well, here it would be useful to understand what the Python interpreter is doing.好吧，在这里了解 Python 解释器在做什么会很有用。

When the interpreter finds the beginning of a string literal当解释器找到字符串字面量的开头时

In your source code, you have this piece of text:在你的源代码中，你有这样一段文字：

'"Foo \u003cfoo@bar.net\u003e"'

When the parser finds the first character, ' , it concludes: "This is a string literal! Until I find the next ' , I should get all characters and put it in a list, to use as a string."当解析器找到第一个字符' ，它得出结论：“这是一个字符串文字！在找到下一个' ，我应该获取所有字符并将其放入一个列表中，以用作字符串。” So, let us say it creates the following list in memory:因此，假设它在内存中创建了以下列表：

[]

Then it finds the next character, " . Since the string literal is not closes (because no ' was found) it adds it to the list. As everything inside computers, characters are represented as numbers. The number is its Unicode point, and for " the code point is 34:然后它找到下一个字符" 。由于字符串文字没有关闭（因为没有找到' ），它将它添加到列表中。作为计算机中的一切，字符都表示为数字。数字是它的 Unicode 点，对于"代码点是 34：

[ 34 ]
#  "

It does the same to the next characters, putting their code points in the list:它对下一个字符执行相同的操作，将它们的代码点放在列表中：

[ 34   70  111  111   32 ]
#  "    F    o    o

The `\\` and `u` characters from your source code源代码中的`\\`和`u`字符

Now, the interpreter finds the character \\ .现在，解释器找到了字符\\ 。 But this is not a common char at all!但这根本不是一个常见的字符！ To the interpreter, it means the next characters do not mean themselves, but should be interpreted.对于解释器来说，这意味着接下来的字符不代表他们自己，而是应该被解释。 So the interpreter does not add \\ to the list, and get the next interpreter to understand what should be done.所以解释器不会将\\添加到列表中，并让下一个解释器了解应该做什么。 This is why there is no \\ in your result.这就是结果中没有\\的原因。

The next character is u .下一个字符是u 。 Since it was prefixed by \\ , the interpreter does not insert it into the list.由于它以\\为前缀，解释器不会将其插入到列表中。 Instead, the \\u\u003c/code> pair is interpreted as a command to get the next four characters, convert them to a hexadecimal number.相反， \\u\u003c/code>对被解释为获取接下来的四个字符的命令，并将它们转换为十六进制数。 That's why there is no \\u\u003c/code> in your results.这就是结果中没有\\u\u003c/code>的原因。

`How six characters become only one六个字符如何变成只有一个`

The next four chars are 0 , 0 , 3 and c .接下来的四个字符是0 、 0 、 3和c 。 They form the 0x3C hex number, that is 60 in decimal form.它们形成 0x3C 十六进制数，即十进制形式的 60。 So it is added to the list:所以它被添加到列表中：

[ 34   70  111  111   32   60 ]
#  "    F    o    o         <

Well, 60 is < in Unicode.好吧，60 在 Unicode 中是< 。 That's why there is a < in your result.这就是为什么你的结果中有一个< 。 This is why the six characters ( \\ , u , 0 , 0 , 3 , c ) actually represent only one ( > ) when the program runs.这就是为什么六个字符（ \\ 、 u 、 0 、 0 、 3 、 c ）在程序运行时实际上只代表一个（ > ）的原因。

`How to get what you want如何得到你想要的`

Of course, you may want to have the characters \\ , u etc. in your result string.当然，您可能希望在结果字符串中包含字符\\ 、 u等。 If so, Python gives you some options, and the simplest one is the raw string literal .如果是这样，Python 会给你一些选择，最简单的一个是原始字符串文字。 To do this, you just need to prefix your string literal with r , as below:为此，您只需要在字符串文字前加上r前缀，如下所示：

r'"Foo \u003cfoo@bar.net\u003e"'

When the interpreter fins the r in the source code, and then a quote (such as ' ), it knows it is a string literal, but this string literal does not have \\ interpreted at all .当解释器解析源代码中的r ，然后是引号（例如' ）时，它知道它是一个字符串文字，但该字符串文字根本没有解释\\ 。 Everything inside it is to be used as it was typed in the source code.它里面的所有东西都将按照在源代码中输入的方式使用。 This brings a result similar to the one you seem to want:这带来了类似于您似乎想要的结果：

>>> print('"Foo \u003cfoo@bar.net\u003e"')
"Foo <foo@bar.net>"
>>> print(r'"Foo \u003cfoo@bar.net\u003e"')
"Foo \u003cfoo@bar.net\u003e"

`Be Careful What You Wish For 小心你的愿望`

Note however that these strings are completely different!但是请注意，这些字符串完全不同！ Even their sizes are very different, because the second one has more characters:甚至它们的大小也大不相同，因为第二个字符更多：

>>> len('"Foo \u003cfoo@bar.net\u003e"')
19
>>> len(r'"Foo \u003cfoo@bar.net\u003e"')
29

Now, I have to say, you likely do not want to have a raw string here.现在，我不得不说，您可能不想在这里使用原始字符串。 You may only be wanting to represent the string with the Unicode points, but it also begs the question of why .您可能只想用 Unicode 点表示字符串，但这也引出了为什么. Anyway, it is up to you now to decide what you want :)无论如何，现在由你决定你想要什么:)

使用 json.loads() 保留 unicode 字符或在执行 json.dumps() 时将它们转换回

问题描述

1 个解决方案

解决方案1
0 2018-08-21 19:42:33

When the interpreter finds the beginning of a string literal当解释器找到字符串字面量的开头时

The `\\` and `u` characters from your source code源代码中的`\\`和`u`字符

`How six characters become only one六个字符如何变成只有一个`

`How to get what you want如何得到你想要的`

`Be Careful What You Wish For 小心你的愿望`

使用 json.loads() 保留 unicode 字符或在执行 json.dumps() 时将它们转换回

问题描述

1 个解决方案

解决方案1 0 2018-08-21 19:42:33

When the interpreter finds the beginning of a string literal当解释器找到字符串字面量的开头时

The \\ and u characters from your source code源代码中的\\和u字符

How six characters become only one六个字符如何变成只有一个

How to get what you want如何得到你想要的

Be Careful What You Wish For 小心你的愿望

解决方案1
0 2018-08-21 19:42:33

The `\\` and `u` characters from your source code源代码中的`\\`和`u`字符

`How six characters become only one六个字符如何变成只有一个`

`How to get what you want如何得到你想要的`

`Be Careful What You Wish For 小心你的愿望`