简体   繁体   English

使用正则表达式解析 python 中的 json 响应

[英]parsing a json response in python with regex

Situation: I have a weather GUI made in tkinter.情况:我有一个用 tkinter 制作的天气 GUI。 It gets data from an api and and displays it on a tkinter label.它从 api 获取数据并将其显示在 tkinter label 上。 One of the functions 'format_alerts' parses json data from the api.函数“format_alerts”之一解析来自 api 的 json 数据。 Because the way the data is formatted I'm having trouble parsing it for what I need.因为数据的格式化方式我无法根据需要解析它。

Problem: I came up with a really weird way of parsing the data.问题:我想出了一种非常奇怪的数据解析方式。 The json uses '...' and 'astrix' to separate values in a string (inside a dictionary). json 使用“...”和“astrix”来分隔字符串中的值(在字典中)。 I use.replace('\n', ' ') to get rid of newlines.我使用 .replace('\n', ' ') 摆脱换行符。 I use.replace('astrix', '@') and.replace('...', '@' to find the split points. Then use.split('@') then reference the list index number. However sometimes the json uses '...' randomly so I end up messing up the indexing. I know regex is a better way to do this but for the life of me I can't get a three part regex search to work.我使用 .replace('astrix', '@') 和 .replace('...', '@' 来查找分割点。然后使用.split('@') 然后引用列表索引号。但有时json 随机使用“...” ,所以我最终弄乱了索引。我知道正则表达式是一种更好的方法,但对于我的生活,我无法让三部分正则表达式搜索工作。

My present code looks like:我现在的代码如下所示:

def format_alerts(weather_json):
    alert_report = ""
    alerts = weather_json['alerts']
    try:
        # for loop is because sometime there are several different alerts at the list level.
        for item in alerts:
            event = item['event']
            details = item['description']
            parsed = details.replace('\n', ' ').replace('*', '@').replace('...', '@').split('@')
            # textwrap is used to make sure it fits in my tkinter label
            what = textwrap.fill(parsed[4], 51)
            where = textwrap.fill(parsed[6], 51)
            when = textwrap.fill(parsed[8], 51)
            # plugs the textwrapped pieces into a single string.
            single_alert = '''{}: {}\nWhere: {}\nWhen: {}\n'''.format(event, what, where, when)
            alert_report += single_alert
except:
    alert_report = "Alerts Error"
    print('ERROR: (format_alerts) retrieving, formatting >>> alert_report ')
return alert_report

weather_json looks like: weather_json 看起来像:

{'alerts: [{
    'event': 'Small Craft Advisory', 
    'description': 
'...SMALL CRAFT ADVISORY NOW IN EFFECT UNTIL 3 AM PST SATURDAY...\n* WHAT...Rough bar conditions
 expected.\n- GENERAL SEAS...Seas 18 to 20 ft today then easing to 14 ft\nlate tonight and 
Sat.\n- FIRST EBB...Around 415 AM Fri. Seas near 20 feet with\nbreakers.\n- SECOND EBB...Strong
 ebb around 415 PM. Seas near 20 ft\nwith breakers.\n* WHERE...In the Main Channel of the 
Columbia River Bar.\n* WHEN...Until 3 AM PST Saturday.\n* IMPACTS...Conditions will be hazardous
 to small craft\nespecially when navigating in or near harbor entrances.'
# I addedd newlines so it wasn't massively long. raw data only has newlines denoted by \n
},]}

I want the returned 'alert_report' string to look like this:我希望返回的“alert_report”字符串如下所示:

'''Small Craft Advisory: Rough bar conditions expected. GENERAL SEAS Seas 18 to 20 ft today 
then easing to 14 ft late tonight and Sat. FIRST EBB Around 415 AM Fri. Seas near 20 feet 
with SECOND EBB Strong ebb around 415 PM. Seas near 20 ft breakers.
Where: In the Main Channel of the Columbia River Bar
When: Until 3 AM PST Saturday.

Note: My present code worked on 30 some alerts this was the first one that broke my code.注意:我现在的代码可以处理 30 个警报,这是第一个破坏我的代码的警报。 I can live without the "GENERAL SEAS Seas 18 to 20 ft..." in the first line.我可以在第一行没有“GENERAL SEAS Seas 18 to 20 ft...”的情况下生活。 But I don't want to cut it off at '.'但我不想在'。 because some alerts are several sentences.因为有些警报是几句话。 I learned regex but I'm not very good with it.我学习了正则表达式,但我不太擅长。

Maybe something like this?也许是这样的?

import re
import textwrap

alert = (
    "...SMALL CRAFT ADVISORY NOW IN EFFECT UNTIL 3 AM PST SATURDAY...\n* WHAT"
    "...Rough bar conditions expected.\n- GENERAL SEAS...Seas 18 to 20 ft "
    "today then easing to 14 ft\nlate tonight and Sat.\n- FIRST EBB...Around "
    "415 AM Fri. Seas near 20 feet with\nbreakers.\n- SECOND EBB...Strong ebb "
    "around 415 PM. Seas near 20 ft\nwith breakers.\n* WHERE...In the Main "
    "Channel of the Columbia River Bar.\n* WHEN...Until 3 AM PST Saturday.\n* "
    "IMPACTS...Conditions will be hazardous to small craft\nespecially when "
    "navigating in or near harbor entrances."
)

# we use this to strip out newlines and '...' markers.
re_garbage = re.compile(r'(\.\.\.|\n)')

# this recognizes the major sections of the alert such
# as '* WHEN' and '* WHERE'.
re_keys = re.compile(r'\* ([A-Z ]+) ([^*]+)')

# This recognizes list items like `- FIRST EBB', etc.
re_item = re.compile(r'- ([A-Z ]+) ([^*-]+)')

# replace newlines and '...' with a space, and strip any
# leading/trailing whitespace.
alert = re_garbage.sub(' ', alert).strip()

# Get rid of the '- ' on list items
alert = re_item.sub(r'\1 \2', alert)

# Extract the major parts into a dictionary
parts = {}
while match := re_keys.search(alert):
    parts[match.group(1)] = match.group(2)
    alert = alert[:match.start()] + alert[match.end():]

# Avengers assemble!
final = '\n'.join([
    textwrap.fill(alert + parts['WHAT']),
    f'When: {parts["WHEN"]}',
    f'Where: {parts["WHERE"]}',
])

print(final)

Which produces:产生:

SMALL CRAFT ADVISORY NOW IN EFFECT UNTIL 3 AM PST SATURDAY  Rough bar
conditions expected. GENERAL SEAS Seas 18 to 20 ft today then easing
to 14 ft late tonight and Sat. FIRST EBB Around 415 AM Fri. Seas near
20 feet with breakers. SECOND EBB Strong ebb around 415 PM. Seas near
20 ft with breakers.
When: Until 3 AM PST Saturday. 
Where: In the Main Channel of the Columbia River Bar. 

If you print a bunch of alert 'descriptions', it looks like there is an optional summary or discussion followed by a bullet list of items.如果您打印一堆警报“描述”,看起来有一个可选的摘要或讨论,然后是项目的项目符号列表。 Bullets start with * , subbullets with - .项目符号以*开头,子项目符号以-开头。 Each bullet has a capitalized key, then '...', then the text.每个项目符号都有一个大写的键,然后是“...”,然后是文本。 As you are only interested in some of the bullets, a regex like this should work (after replacing... with a space):由于您只对某些项目符号感兴趣,因此像这样的正则表达式应该可以工作(在用空格替换...之后):

pattern = re.compile(r"[*] (WHAT|WHERE|WHEN) ([^*]+)")

Using pattern.findall() will result in a list of two-tuples.使用pattern.findall()将产生一个二元组列表。 The first element is the 'WHAT', 'WHERE', or 'WHEN'.第一个元素是“WHAT”、“WHERE”或“WHEN”。 The second element is the description up to the next '*' or end of the string.第二个元素是直到下一个 '*' 或字符串结尾的描述。 Using dict() on that list will create a dictionary with 'WHAT', 'WHERE', and 'WHEN' as the keys and the captured text as the values.在该列表上使用dict()将创建一个字典,其中“WHAT”、“WHERE”和“WHEN”作为键,捕获的文本作为值。

Put it in a function to build a report for a single alert:将其放入 function 以构建单个警报的报告:

def format_alert(event, details):
    try:
        details = details.replace('\n', ' ').replace('- ', '').replace('...', ' ')
    
        info = dict(re.findall(r"[*] (WHAT|WHERE|WHEN) ([^*]+)", details))
    
        what = textwrap.fill(f"{event.title()}: {info['WHAT']}", 51)
        where = textwrap.fill(f"Where: {info['WHERE']}", 51)
        when = textwrap.fill(f"When: {info['WHEN']}", 51)
        report = f"{what}\n{where}\n{when}\n"
                    
    except (KeyError, ValueError):
        report = "Alert Error"
        print('ERROR: (format_alerts) retrieving, formatting >>> alert_report ')
        
    return report

For the sample input, it returns:对于示例输入,它返回:

Small Craft Warning: Rough bar conditions expected.
GENERAL SEAS Seas 18 to 20 ft today then easing to
14 ft late tonight and Sat. FIRST EBB Around 415 AM
Fri. Seas near 20 feet with breakers. SECOND EBB
Strong ebb around 415 PM. Seas near 20 ft with
breakers.
Where: In the Main Channel of the Columbia River
Bar.
When: Until 3 AM PST Saturday.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM