简体   繁体   中英

Parse online text file for only most recent data

8/23 Edit:

Thank you guys for the replies and for code that's probably a bit more efficient than mine. However, I didn't do the best job of describing exactly what I'm trying to capture.

@DarkKnight is correct that the important token I'm querying is in column 5. But for each of those important tokens there can be up to 15 lines of text that I need to parse in order to capture the full model run. Using "TVCN" as an example, I need to capture all of this:

AL, 07, 2021082118, 03, TVCN, 0, 197N, 995W, 0

AL, 07, 2021082118, 03, TVCN, 12, 194N, 1026W, 0

AL, 07, 2021082118, 03, TVCN, 24, 191N, 1055W, 0

AL, 07, 2021082118, 03, TVCN, 36, 198N, 1084W, 0

AL, 07, 2021082118, 03, TVCN, 48, 202N, 1113W, 0

AL, 07, 2021082118, 03, TVCN, 60, 204N, 1139W, 0

AL, 07, 2021082118, 03, TVCN, 72, 208N, 1164W, 0

AL, 07, 2021082118, 03, TVCN, 84, 210N, 1188W, 0

AL, 07, 2021082118, 03, TVCN, 96, 211N, 1209W, 0

AL, 07, 2021082118, 03, TVCN, 108, 206N, 1230W, 0

AL, 07, 2021082118, 03, TVCN, 120, 201N, 1251W, 0

Column 3 is the date/time of the model run (yyyymmddhh), while column 6 is the forecast hour. So in order to plot the forecast through time but only capture the most recent model run, I need to return all instances of TVCN dated '2021082118'. And, of course, the date value updates each time the model is run again. Does that make sense?


I have code that is partially working for my needs, but I'm stuck on trying to get it to exactly where I want. I'm pulling comma separated data in from an online text file. My code then tosses out the lines I don't want. These are raw data for hurricane forecast models. However, the online text file stores all model runs for a given storm. I would like to only pull the most recent run for the models I selected. Each model has multiple lines of text for a given model run (forecast t+12, t+24, etc). Can this be accomplished?

Here's what I have that is partially working:

import urllib.request

webf = urllib.request.urlopen("http://hurricanes.ral.ucar.edu/realtime/plots/northatlantic/2021/al072021/aal072021.dat")
lines = webf.readlines() 

important_codes = ["AEM2", "AEMI", "AVNI", "CEM2", "COT2", "CTC1", "DSHP", "EGR2", "HMON", "HWFI", "NNIB", "LGEM", "NNIC", "OFCI", "OFCL", "SHIP", "TVCN", "UKX2"]

def is_important_line(line):
    return any(code in line for code in important_codes)

output_lines = []
for line in lines:
    decoded_line = line.decode("utf-8") 

    if not is_important_line(decoded_line):
        continue
    output_lines.append(decoded_line)

f = open('test.txt', 'w') 

f.write("".join(output_lines)) 
f.close()

alright, i filtered on the wrong column. this should work

output_lines = []
for line in lines:
    line = line.decode("utf-8") 
    line = line.split(', ')[:-1]

    if line[4] not in important_codes:
        continue
    output_lines.append(line)
    
output_lines = sorted(output_lines, key=lambda x: x[4])
new_output = []
for code, group in groupby(output_lines, key=lambda x: x[4]):
    best_date = 0
    temp_entries = []
    for date, entries in groupby(group, key=lambda x: x[2]):
        date = int(date)
        if date > best_date:
            best_date = date
            temp_entries = list(entries)
    for entry in temp_entries:
        new_output.append(', '.join(entry))

with open('mydata.dat', 'w') as f:
    f.write('\n'.join(new_output))

Probably better if you write the output file as you iterate over the input data. The "important" token seems to be in column 5 (base 1). Your code could lead to ambiguous results if, for example, 'AVNI' appeared somewhere else in a line. Try this:-

import requests

IC = ["AEM2", "AEMI", "AVNI", "CEM2", "COT2", "CTC1", "DSHP", "EGR2", "HMON",
      "HWFI", "NNIB", "LGEM", "NNIC", "OFCI", "OFCL", "SHIP", "TVCN", "UKX2"]

with open('test.txt', 'w') as outfile:
    with requests.Session() as session:
        response = session.get(
            'http://hurricanes.ral.ucar.edu/realtime/plots/northatlantic/2021/al072021/aal072021.dat')
        response.raise_for_status()
        for line in response.text.splitlines():
            try:
                if line.split(',')[4].strip() in IC:
                    outfile.write(f'{line}\n')
            except IndexError:
                pass
print('Done')

EDIT: If you're only interested in the most recent occurrence of an "important" token, then you could do this:-

import requests

IC = ["AEM2", "AEMI", "AVNI", "CEM2", "COT2", "CTC1", "DSHP", "EGR2", "HMON",
      "HWFI", "NNIB", "LGEM", "NNIC", "OFCI", "OFCL", "SHIP", "TVCN", "UKX2"]

with open('test.txt', 'w') as outfile:
    TD = {}
    with requests.Session() as session:
        response = session.get(
            'http://hurricanes.ral.ucar.edu/realtime/plots/northatlantic/2021/al072021/aal072021.dat')
        response.raise_for_status()
        for line in response.text.splitlines():
            try:
                if (k := line.split(',')[4].strip()) in IC:
                    TD[k] = line
            except IndexError:
                pass
    for v in TD.values():
        outfile.write(f'{v}\n')
print('Done')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM