简体   繁体   English

Python读取数据的文本文件,然后从字符串列表中提取子字符串

[英]Python read text file for data then extract sub-strings from list of strings

I have a weather data file that has high temps, low temp, rainfall, etc. I need to open the file and return data based on year ranges from user input.我有一个天气数据文件,它有高温、低温、降雨等。我需要打开文件并根据用户输入的年份范围返回数据。 User inputs a starting date and ending date then I put that data into a list that user can then search for highest (HIGHTEMP) or lowest temps (LOWTEMP) or highest rainfall (PRCP) in that sub-list of data of year ranges.用户输入开始日期和结束日期,然后我将该数据放入一个列表中,然后用户可以在该年份范围的数据子列表中搜索最高 (HIGHTEMP) 或最低温度 (LOWTEMP) 或最高降雨量 (PRCP)。 Currently I can search for strings, but not sure how to identify the high temps, for example, then gather the high temps in the sub-list, then find the highest, then return that data.目前我可以搜索字符串,但不确定如何识别高温,例如,然后收集子列表中的高温,然后找到最高的,然后返回该数据。 Same with low temp and rain fall.与低温和降雨相同。

Here is what I have so far:这是我到目前为止所拥有的:

def openFile():
    begin = input("Enter your starting year in this format YYYY ")
    end = input("Enter your ending year for weather data in this format YYYY ")

    lines = tuple(open('/Users/jasontt/test/spokaneweatherdata.txt', 'r'))
    #print(lines)
    print("")
    #print(lines[1])
    print("")

    result = [i for i in lines if str(begin) in i]
    #print("This is begining data ", result)

    resultTwo = [i for i in lines if str(end) in i]
    #print("This is end of data ", resultTwo)
    #Combined list based on years entered
    ultimateList = [result + resultTwo]
    #Combined list of weather data for years selected
    print(ultimateList)

    '''

Test Data:测试数据:

STATION           STATION_NAME                                       ELEVATION  LATITUDE   LONGITUDE  DATE     PRCP     TEMPMAX     TEMPMIN
----------------- -------------------------------------------------- ---------- ---------- ---------- -------- -------- -------- --------
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490101 0.00     44       27
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490102 0.00     42       25
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490103 0.15     46       30
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490104 0.03     41       30
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490105 1.14     46       37
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490106 0.00     51       40
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490107 0.00     57       36
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490108 0.00     56       45
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490109 0.00     66       42
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490110 0.00     70       51
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490111 0.03     59       45
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490112 0.04     48       38
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490113 0.00     52       36
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490114 0.00     56       36
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490115 0.00     49       31
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490116 0.00     68       28
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490117 0.00     63       50
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490118 0.04     53       42
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490119 0.01     63       38
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490120 0.00     45       28
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490121 0.97     35       28
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490122 0.29     60       34
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490123 0.14     47       38
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490124 0.01     72       38
GHCND:USW00013741                     SPOKANE REGIONAL AIRPORT WA US      366.1   37.31667  -79.96667 19490125 0.05     66       49

It's difficult to tell from a copy-pasted data sample, but it looks like your file is using a "fixed-width" line format - each column in a line starts at a given position and ends at a given position.很难从复制粘贴的数据样本中分辨出来,但看起来您的文件正在使用“固定宽度”行格式 - 行中的每一列都从给定位置开始并在给定位置结束。 This was a quite common type of "format" by the days...这在当时是一种非常常见的“格式”......

So what you want here is to write down each columns name, start and end position, so you can easily parse the lines into fields, ie:所以你在这里想要的是写下每一列的名称,开始和结束位置,这样你就可以轻松地将行解析为字段,即:

FORMAT_MAP = {
    # fieldname : (start, end)
    "STATION": (0, 17),
    "STATION_NAME": (18, 68),
    "ELEVATION": (69, 79),
    # etc...
    }


def parse_line(line):
    return {name: line[start:end].strip() for name, (start, end) in FORMAT_MAP.items()}

Now you can parse your file into a sequence of fields dicts:现在您可以将文件解析为一系列字段字典:

def iter_parse_file(f, startyear, endyear):
   # skip the first two header lines
   next(f);  next(f)

   for line in f: 
      # we assume the lines are sorted on date, and that the
      # date format is YYYYMMDD. 
      row = parse_line(line)
      year = row["DATE"][:4]
      if year < startyear:
         continue
      elif year > endyear:
         break
      yield row


with open("your/file.ext") as f:
    rows = list(iter_parse_file(f, startyear, endyear))

for row in rows:
    print("{DATE} : {TEMPMIN} - {TEMPMAX}".format(**row))

you can also filter, sort etc on columns values, build a panda dataframe etc.您还可以对列值进行过滤、排序等,构建熊猫数据框等。

Note that you can (and probably want to) convert your data to the proper type during parsing.请注意,您可以(并且可能想要)在解析过程中将数据转换为正确的类型。 With the above starting point you should be able to do so quite easily.有了上面的起点,你应该能够很容易地做到这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM