简体   繁体   中英

Parsing JSON using Pandas - issue with additional \ escape characters

I'm downloading files from S3 that contains JSON (like) data which I intend to parse into a Pandas dataframe using pd.read_json .

My problem is that the files dumped into the S3 bucket use an 'octal escape' formatting for non english characters but Python/Pandas objects to the fact that an escape for the \\ character is also included.

An example would be the string: "destination":"Provence-Alpes-C\\\\303\\\\264te d\\'Azur"

Which prints as:

在此处输入图片说明

If I manually remove one of the \\ characters then Python happily interprets the string and it prints as:

在此处输入图片说明

There is some good stuff in this thread and although .decode('string_escape') works well on an individual snippet, when its part of the much longer string comprising thousands of records then it doesn't work.

I believe that I need a clever way to replace the \\\\ with \\ but for well documented reasons, .replace('\\\\', '\\') doesn't work.

In order to get the files to work at all I used a regex to remove all \\ followed by a number: re.sub(r'\\\\(?=[0-9])', '', g) - I'm thinking that an adaptation of this might be the way forward but the number needs to be dynamic as I don't know what it will be (ie using \\3 and \\2 for the example above isn't going to work')

Help appreciated.

Rather than have Python interpret \\ooo octal escapes, repair the JSON with a regular expression, then parse it as JSON. I did so before in similar circumstances

Your data has UTF-8 bytes escaped to octal \\ooo sequences, so you are looking for a more limited range of values here:

import re

invalid_escape = re.compile(r'\\([1-3][0-7]{2}|[1-7][0-7]?)')  # octal digits from 1 up to FF
def replace_with_codepoint(match):
    return chr(int(match.group(0)[1:], 8))

def repair(brokenjson):
    return invalid_escape.sub(replace_with_codepoint, brokenjson)

Demo:

>>> import json
>>> sample = '{"destination":"Provence-Alpes-C\\303\\264te d\'Azur"}'
>>> repair(sample)
'{"destination":"Provence-Alpes-C\xc3\xb4te d\'Azur"}'
>>> json.loads(repair(sample))
{u'destination': u"Provence-Alpes-C\xf4te d'Azur"}
>>> print json.loads(repair(sample))['destination']
Provence-Alpes-Côte d'Azur

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM