简体   繁体   中英

How to parse "{Key=Value}" structs into JSON with Python?

If I run an Athena query in AWS, the data I get back has structs with key/value pairs that look like this:

{
    "events": "[{deviceType=Android,logins=400},{deviceType=iPhone,logins=550}]"
}

I can use regular expressions to parse this, but things like special characters make that de-serialization very error-prone.

For example, {deviceType=Android, date=2022-01-01} will run into issues with delimiters if I use regex.

Is there an existing de-serializer for this type of thing?

EDIT:

This is the de-serialize regex I have:

def deserialize(s):
    # Surround any word with "
    s1 = re.sub('(\w+)', '"\g<1>"', s)

    # Replace = with :
    s2 = re.sub('=', ':', s1)

    return json.loads(s2)

This hits issues when there are special characters in the value like "-" or "." Regex isn't able to properly determine the "word", so doesn't place the enclosing quotes properly.

The data inside the quotes is almost JSON but it's missing the quotes around keys and values. With a few judiciously chained .replace() method calls, you should be able to convert it from almost-JSON to JSON and then deserialize it using the json module:

import json
obj = {"events": "[{deviceType=Android, date=2022-01-01}]"}
events = obj['events']
events_json = events.replace(', ', ',').replace('{', '{"').replace('}', '"}').replace('=', '":"').replace(',', '","').replace('}","{','},{')
parsed = json.loads(events_json)
print(parsed[0])

print(parsed[0]['deviceType']) # prints 'Android'
print(parsed[0]['date']) # prints '2022-01-01'

*Edit to fix an issue raised by MisterMiyagi.

Instead of parsing this not-quite-JSON I recommend casting maps and arrays to JSON in your queries:

SELECT CAST(events AS JSON) AS events …

This has the added benefit of making the output less ambiguous to parse (eg without casting to JSON there is no way to know if "[1, 2, 3]" was an array of integers or strings, or if "[hello, world]" was an array of two elements, or one element with a comma inside).

Given the data as shown, you can isolate the strings between curly brackets with RE then further split those strings into their component parts. Here's an example:

import re

d = {'events': "[{deviceType=Android,logins=400},{deviceType=iPhone,logins=550}]"}

for t in re.findall('(?<={).+?(?=})', d['events']):
    for p in t.split(','):
        print(p)

Output:

deviceType=Android
logins=400
deviceType=iPhone
logins=550

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM