简体   繁体   中英

parsing a json response in python with regex

Situation: I have a weather GUI made in tkinter. It gets data from an api and and displays it on a tkinter label. One of the functions 'format_alerts' parses json data from the api. Because the way the data is formatted I'm having trouble parsing it for what I need.

Problem: I came up with a really weird way of parsing the data. The json uses '...' and 'astrix' to separate values in a string (inside a dictionary). I use.replace('\n', ' ') to get rid of newlines. I use.replace('astrix', '@') and.replace('...', '@' to find the split points. Then use.split('@') then reference the list index number. However sometimes the json uses '...' randomly so I end up messing up the indexing. I know regex is a better way to do this but for the life of me I can't get a three part regex search to work.

My present code looks like:

def format_alerts(weather_json):
    alert_report = ""
    alerts = weather_json['alerts']
    try:
        # for loop is because sometime there are several different alerts at the list level.
        for item in alerts:
            event = item['event']
            details = item['description']
            parsed = details.replace('\n', ' ').replace('*', '@').replace('...', '@').split('@')
            # textwrap is used to make sure it fits in my tkinter label
            what = textwrap.fill(parsed[4], 51)
            where = textwrap.fill(parsed[6], 51)
            when = textwrap.fill(parsed[8], 51)
            # plugs the textwrapped pieces into a single string.
            single_alert = '''{}: {}\nWhere: {}\nWhen: {}\n'''.format(event, what, where, when)
            alert_report += single_alert
except:
    alert_report = "Alerts Error"
    print('ERROR: (format_alerts) retrieving, formatting >>> alert_report ')
return alert_report

weather_json looks like:

{'alerts: [{
    'event': 'Small Craft Advisory', 
    'description': 
'...SMALL CRAFT ADVISORY NOW IN EFFECT UNTIL 3 AM PST SATURDAY...\n* WHAT...Rough bar conditions
 expected.\n- GENERAL SEAS...Seas 18 to 20 ft today then easing to 14 ft\nlate tonight and 
Sat.\n- FIRST EBB...Around 415 AM Fri. Seas near 20 feet with\nbreakers.\n- SECOND EBB...Strong
 ebb around 415 PM. Seas near 20 ft\nwith breakers.\n* WHERE...In the Main Channel of the 
Columbia River Bar.\n* WHEN...Until 3 AM PST Saturday.\n* IMPACTS...Conditions will be hazardous
 to small craft\nespecially when navigating in or near harbor entrances.'
# I addedd newlines so it wasn't massively long. raw data only has newlines denoted by \n
},]}

I want the returned 'alert_report' string to look like this:

'''Small Craft Advisory: Rough bar conditions expected. GENERAL SEAS Seas 18 to 20 ft today 
then easing to 14 ft late tonight and Sat. FIRST EBB Around 415 AM Fri. Seas near 20 feet 
with SECOND EBB Strong ebb around 415 PM. Seas near 20 ft breakers.
Where: In the Main Channel of the Columbia River Bar
When: Until 3 AM PST Saturday.

Note: My present code worked on 30 some alerts this was the first one that broke my code. I can live without the "GENERAL SEAS Seas 18 to 20 ft..." in the first line. But I don't want to cut it off at '.' because some alerts are several sentences. I learned regex but I'm not very good with it.

Maybe something like this?

import re
import textwrap

alert = (
    "...SMALL CRAFT ADVISORY NOW IN EFFECT UNTIL 3 AM PST SATURDAY...\n* WHAT"
    "...Rough bar conditions expected.\n- GENERAL SEAS...Seas 18 to 20 ft "
    "today then easing to 14 ft\nlate tonight and Sat.\n- FIRST EBB...Around "
    "415 AM Fri. Seas near 20 feet with\nbreakers.\n- SECOND EBB...Strong ebb "
    "around 415 PM. Seas near 20 ft\nwith breakers.\n* WHERE...In the Main "
    "Channel of the Columbia River Bar.\n* WHEN...Until 3 AM PST Saturday.\n* "
    "IMPACTS...Conditions will be hazardous to small craft\nespecially when "
    "navigating in or near harbor entrances."
)

# we use this to strip out newlines and '...' markers.
re_garbage = re.compile(r'(\.\.\.|\n)')

# this recognizes the major sections of the alert such
# as '* WHEN' and '* WHERE'.
re_keys = re.compile(r'\* ([A-Z ]+) ([^*]+)')

# This recognizes list items like `- FIRST EBB', etc.
re_item = re.compile(r'- ([A-Z ]+) ([^*-]+)')

# replace newlines and '...' with a space, and strip any
# leading/trailing whitespace.
alert = re_garbage.sub(' ', alert).strip()

# Get rid of the '- ' on list items
alert = re_item.sub(r'\1 \2', alert)

# Extract the major parts into a dictionary
parts = {}
while match := re_keys.search(alert):
    parts[match.group(1)] = match.group(2)
    alert = alert[:match.start()] + alert[match.end():]

# Avengers assemble!
final = '\n'.join([
    textwrap.fill(alert + parts['WHAT']),
    f'When: {parts["WHEN"]}',
    f'Where: {parts["WHERE"]}',
])

print(final)

Which produces:

SMALL CRAFT ADVISORY NOW IN EFFECT UNTIL 3 AM PST SATURDAY  Rough bar
conditions expected. GENERAL SEAS Seas 18 to 20 ft today then easing
to 14 ft late tonight and Sat. FIRST EBB Around 415 AM Fri. Seas near
20 feet with breakers. SECOND EBB Strong ebb around 415 PM. Seas near
20 ft with breakers.
When: Until 3 AM PST Saturday. 
Where: In the Main Channel of the Columbia River Bar. 

If you print a bunch of alert 'descriptions', it looks like there is an optional summary or discussion followed by a bullet list of items. Bullets start with * , subbullets with - . Each bullet has a capitalized key, then '...', then the text. As you are only interested in some of the bullets, a regex like this should work (after replacing... with a space):

pattern = re.compile(r"[*] (WHAT|WHERE|WHEN) ([^*]+)")

Using pattern.findall() will result in a list of two-tuples. The first element is the 'WHAT', 'WHERE', or 'WHEN'. The second element is the description up to the next '*' or end of the string. Using dict() on that list will create a dictionary with 'WHAT', 'WHERE', and 'WHEN' as the keys and the captured text as the values.

Put it in a function to build a report for a single alert:

def format_alert(event, details):
    try:
        details = details.replace('\n', ' ').replace('- ', '').replace('...', ' ')
    
        info = dict(re.findall(r"[*] (WHAT|WHERE|WHEN) ([^*]+)", details))
    
        what = textwrap.fill(f"{event.title()}: {info['WHAT']}", 51)
        where = textwrap.fill(f"Where: {info['WHERE']}", 51)
        when = textwrap.fill(f"When: {info['WHEN']}", 51)
        report = f"{what}\n{where}\n{when}\n"
                    
    except (KeyError, ValueError):
        report = "Alert Error"
        print('ERROR: (format_alerts) retrieving, formatting >>> alert_report ')
        
    return report

For the sample input, it returns:

Small Craft Warning: Rough bar conditions expected.
GENERAL SEAS Seas 18 to 20 ft today then easing to
14 ft late tonight and Sat. FIRST EBB Around 415 AM
Fri. Seas near 20 feet with breakers. SECOND EBB
Strong ebb around 415 PM. Seas near 20 ft with
breakers.
Where: In the Main Channel of the Columbia River
Bar.
When: Until 3 AM PST Saturday.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM