简体   繁体   English

仅分析最新数据的在线文本文件

[英]Parse online text file for only most recent data

8/23 Edit: 8/23 编辑:

Thank you guys for the replies and for code that's probably a bit more efficient than mine.谢谢你们的回复和可能比我的更有效率的代码。 However, I didn't do the best job of describing exactly what I'm trying to capture.但是,我并没有尽最大努力准确描述我要捕捉的内容。

@DarkKnight is correct that the important token I'm querying is in column 5. But for each of those important tokens there can be up to 15 lines of text that I need to parse in order to capture the full model run. @DarkKnight 是正确的,我正在查询的重要标记位于第 5 列。但是对于这些重要标记中的每一个,我最多需要解析 15 行文本才能捕获完整的 model 运行。 Using "TVCN" as an example, I need to capture all of this:以“TVCN”为例,我需要捕获所有这些:

AL, 07, 2021082118, 03, TVCN, 0, 197N, 995W, 0 AL, 07, 2021082118, 03, TVCN, 0, 197N, 995W, 0

AL, 07, 2021082118, 03, TVCN, 12, 194N, 1026W, 0 AL, 07, 2021082118, 03, TVCN, 12, 194N, 1026W, 0

AL, 07, 2021082118, 03, TVCN, 24, 191N, 1055W, 0 AL, 07, 2021082118, 03, TVCN, 24, 191N, 1055W, 0

AL, 07, 2021082118, 03, TVCN, 36, 198N, 1084W, 0 AL, 07, 2021082118, 03, TVCN, 36, 198N, 1084W, 0

AL, 07, 2021082118, 03, TVCN, 48, 202N, 1113W, 0 AL, 07, 2021082118, 03, TVCN, 48, 202N, 1113W, 0

AL, 07, 2021082118, 03, TVCN, 60, 204N, 1139W, 0 AL, 07, 2021082118, 03, TVCN, 60, 204N, 1139W, 0

AL, 07, 2021082118, 03, TVCN, 72, 208N, 1164W, 0 AL, 07, 2021082118, 03, TVCN, 72, 208N, 1164W, 0

AL, 07, 2021082118, 03, TVCN, 84, 210N, 1188W, 0 AL, 07, 2021082118, 03, TVCN, 84, 210N, 1188W, 0

AL, 07, 2021082118, 03, TVCN, 96, 211N, 1209W, 0 AL, 07, 2021082118, 03, TVCN, 96, 211N, 1209W, 0

AL, 07, 2021082118, 03, TVCN, 108, 206N, 1230W, 0 AL, 07, 2021082118, 03, TVCN, 108, 206N, 1230W, 0

AL, 07, 2021082118, 03, TVCN, 120, 201N, 1251W, 0 AL, 07, 2021082118, 03, TVCN, 120, 201N, 1251W, 0

Column 3 is the date/time of the model run (yyyymmddhh), while column 6 is the forecast hour.第 3 列是 model 运行的日期/时间 (yyyymmddhh),而第 6 列是预测时间。 So in order to plot the forecast through time but only capture the most recent model run, I need to return all instances of TVCN dated '2021082118'.因此,为了通过时间预测 plot 但仅捕获最近的 model 运行,我需要返回日期为“2021082118”的所有 TVCN 实例。 And, of course, the date value updates each time the model is run again.当然,每次再次运行 model 时,日期值都会更新。 Does that make sense?那有意义吗?


I have code that is partially working for my needs, but I'm stuck on trying to get it to exactly where I want.我有部分代码可以满足我的需要,但我一直在努力将它准确地送到我想要的地方。 I'm pulling comma separated data in from an online text file.我从在线文本文件中提取逗号分隔的数据。 My code then tosses out the lines I don't want.然后我的代码抛出我不想要的行。 These are raw data for hurricane forecast models.这些是飓风预报模型的原始数据。 However, the online text file stores all model runs for a given storm.但是,在线文本文件存储了给定风暴的所有 model 次运行。 I would like to only pull the most recent run for the models I selected.我只想为我选择的模型拉出最近的运行。 Each model has multiple lines of text for a given model run (forecast t+12, t+24, etc).对于给定的 model 运行(预测 t+12、t+24 等),每个 model 都有多行文本。 Can this be accomplished?这能实现吗?

Here's what I have that is partially working:这是我所拥有的部分工作的东西:

import urllib.request

webf = urllib.request.urlopen("http://hurricanes.ral.ucar.edu/realtime/plots/northatlantic/2021/al072021/aal072021.dat")
lines = webf.readlines() 

important_codes = ["AEM2", "AEMI", "AVNI", "CEM2", "COT2", "CTC1", "DSHP", "EGR2", "HMON", "HWFI", "NNIB", "LGEM", "NNIC", "OFCI", "OFCL", "SHIP", "TVCN", "UKX2"]

def is_important_line(line):
    return any(code in line for code in important_codes)

output_lines = []
for line in lines:
    decoded_line = line.decode("utf-8") 

    if not is_important_line(decoded_line):
        continue
    output_lines.append(decoded_line)

f = open('test.txt', 'w') 

f.write("".join(output_lines)) 
f.close()

alright, i filtered on the wrong column.好吧,我过滤错列了。 this should work这应该工作

output_lines = []
for line in lines:
    line = line.decode("utf-8") 
    line = line.split(', ')[:-1]

    if line[4] not in important_codes:
        continue
    output_lines.append(line)
    
output_lines = sorted(output_lines, key=lambda x: x[4])
new_output = []
for code, group in groupby(output_lines, key=lambda x: x[4]):
    best_date = 0
    temp_entries = []
    for date, entries in groupby(group, key=lambda x: x[2]):
        date = int(date)
        if date > best_date:
            best_date = date
            temp_entries = list(entries)
    for entry in temp_entries:
        new_output.append(', '.join(entry))

with open('mydata.dat', 'w') as f:
    f.write('\n'.join(new_output))

Probably better if you write the output file as you iterate over the input data.如果在遍历输入数据时编写 output 文件可能会更好。 The "important" token seems to be in column 5 (base 1). “重要”标记似乎在第 5 列(基数 1)中。 Your code could lead to ambiguous results if, for example, 'AVNI' appeared somewhere else in a line.例如,如果“AVNI”出现在一行中的其他地方,您的代码可能会导致不明确的结果。 Try this:-尝试这个:-

import requests

IC = ["AEM2", "AEMI", "AVNI", "CEM2", "COT2", "CTC1", "DSHP", "EGR2", "HMON",
      "HWFI", "NNIB", "LGEM", "NNIC", "OFCI", "OFCL", "SHIP", "TVCN", "UKX2"]

with open('test.txt', 'w') as outfile:
    with requests.Session() as session:
        response = session.get(
            'http://hurricanes.ral.ucar.edu/realtime/plots/northatlantic/2021/al072021/aal072021.dat')
        response.raise_for_status()
        for line in response.text.splitlines():
            try:
                if line.split(',')[4].strip() in IC:
                    outfile.write(f'{line}\n')
            except IndexError:
                pass
print('Done')

EDIT: If you're only interested in the most recent occurrence of an "important" token, then you could do this:-编辑:如果您只对最近出现的“重要”标记感兴趣,那么您可以这样做:-

import requests

IC = ["AEM2", "AEMI", "AVNI", "CEM2", "COT2", "CTC1", "DSHP", "EGR2", "HMON",
      "HWFI", "NNIB", "LGEM", "NNIC", "OFCI", "OFCL", "SHIP", "TVCN", "UKX2"]

with open('test.txt', 'w') as outfile:
    TD = {}
    with requests.Session() as session:
        response = session.get(
            'http://hurricanes.ral.ucar.edu/realtime/plots/northatlantic/2021/al072021/aal072021.dat')
        response.raise_for_status()
        for line in response.text.splitlines():
            try:
                if (k := line.split(',')[4].strip()) in IC:
                    TD[k] = line
            except IndexError:
                pass
    for v in TD.values():
        outfile.write(f'{v}\n')
print('Done')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM