简体   繁体   中英

preserve unicode characters with json.loads() or convert them back to when doing a json.dumps()

I have a json file that contains the unicode characters \< and \> . When loading the file with json.load() these characters get converted to < and > . Consider the following experiment:

d = json.loads('"Foo \u003cfoo@bar.net\u003e"')

Which then prints like:

'Foo <foo@bar.net>'

Say that I need to dump this back to a file and need to have the characters < and > converted back to \< and \> . I am currently using f.write(json.dumps(d)) but that does not seem to work.

I have searched for hours but am just not able to figure this out.

Well, here it would be useful to understand what the Python interpreter is doing.

When the interpreter finds the beginning of a string literal

In your source code, you have this piece of text:

'"Foo \u003cfoo@bar.net\u003e"'

When the parser finds the first character, ' , it concludes: "This is a string literal! Until I find the next ' , I should get all characters and put it in a list, to use as a string." So, let us say it creates the following list in memory:

[]

Then it finds the next character, " . Since the string literal is not closes (because no ' was found) it adds it to the list. As everything inside computers, characters are represented as numbers. The number is its Unicode point, and for " the code point is 34:

[ 34 ]
#  "

It does the same to the next characters, putting their code points in the list:

[ 34   70  111  111   32 ]
#  "    F    o    o       

The \\ and u characters from your source code

Now, the interpreter finds the character \\ . But this is not a common char at all! To the interpreter, it means the next characters do not mean themselves, but should be interpreted. So the interpreter does not add \\ to the list, and get the next interpreter to understand what should be done. This is why there is no \\ in your result.

The next character is u . Since it was prefixed by \\ , the interpreter does not insert it into the list. Instead, the \\u\u003c/code> pair is interpreted as a command to get the next four characters, convert them to a hexadecimal number. That's why there is no \\u\u003c/code> in your results.

How six characters become only one

The next four chars are 0 , 0 , 3 and c . They form the 0x3C hex number, that is 60 in decimal form. So it is added to the list:

[ 34   70  111  111   32   60 ]
#  "    F    o    o         <

Well, 60 is < in Unicode. That's why there is a < in your result. This is why the six characters ( \\ , u , 0 , 0 , 3 , c ) actually represent only one ( > ) when the program runs.

How to get what you want

Of course, you may want to have the characters \\ , u etc. in your result string. If so, Python gives you some options, and the simplest one is the raw string literal . To do this, you just need to prefix your string literal with r , as below:

r'"Foo \u003cfoo@bar.net\u003e"'

When the interpreter fins the r in the source code, and then a quote (such as ' ), it knows it is a string literal, but this string literal does not have \\ interpreted at all . Everything inside it is to be used as it was typed in the source code. This brings a result similar to the one you seem to want:

>>> print('"Foo \u003cfoo@bar.net\u003e"')
"Foo <foo@bar.net>"
>>> print(r'"Foo \u003cfoo@bar.net\u003e"')
"Foo \u003cfoo@bar.net\u003e"

Be Careful What You Wish For

Note however that these strings are completely different! Even their sizes are very different, because the second one has more characters:

>>> len('"Foo \u003cfoo@bar.net\u003e"')
19
>>> len(r'"Foo \u003cfoo@bar.net\u003e"')
29

Now, I have to say, you likely do not want to have a raw string here. You may only be wanting to represent the string with the Unicode points, but it also begs the question of why . Anyway, it is up to you now to decide what you want :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM