![](/img/trans.png)
[英]How to extract these sub-strings from a string with regex in python?
[英]Python read text file for data then extract sub-strings from list of strings
我有一個天氣數據文件,它有高溫、低溫、降雨等。我需要打開文件並根據用戶輸入的年份范圍返回數據。 用戶輸入開始日期和結束日期,然后我將該數據放入一個列表中,然后用戶可以在該年份范圍的數據子列表中搜索最高 (HIGHTEMP) 或最低溫度 (LOWTEMP) 或最高降雨量 (PRCP)。 目前我可以搜索字符串,但不確定如何識別高溫,例如,然后收集子列表中的高溫,然后找到最高的,然后返回該數據。 與低溫和降雨相同。
這是我到目前為止所擁有的:
def openFile():
begin = input("Enter your starting year in this format YYYY ")
end = input("Enter your ending year for weather data in this format YYYY ")
lines = tuple(open('/Users/jasontt/test/spokaneweatherdata.txt', 'r'))
#print(lines)
print("")
#print(lines[1])
print("")
result = [i for i in lines if str(begin) in i]
#print("This is begining data ", result)
resultTwo = [i for i in lines if str(end) in i]
#print("This is end of data ", resultTwo)
#Combined list based on years entered
ultimateList = [result + resultTwo]
#Combined list of weather data for years selected
print(ultimateList)
'''
測試數據:
STATION STATION_NAME ELEVATION LATITUDE LONGITUDE DATE PRCP TEMPMAX TEMPMIN
----------------- -------------------------------------------------- ---------- ---------- ---------- -------- -------- -------- --------
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490101 0.00 44 27
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490102 0.00 42 25
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490103 0.15 46 30
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490104 0.03 41 30
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490105 1.14 46 37
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490106 0.00 51 40
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490107 0.00 57 36
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490108 0.00 56 45
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490109 0.00 66 42
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490110 0.00 70 51
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490111 0.03 59 45
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490112 0.04 48 38
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490113 0.00 52 36
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490114 0.00 56 36
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490115 0.00 49 31
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490116 0.00 68 28
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490117 0.00 63 50
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490118 0.04 53 42
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490119 0.01 63 38
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490120 0.00 45 28
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490121 0.97 35 28
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490122 0.29 60 34
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490123 0.14 47 38
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490124 0.01 72 38
GHCND:USW00013741 SPOKANE REGIONAL AIRPORT WA US 366.1 37.31667 -79.96667 19490125 0.05 66 49
很難從復制粘貼的數據樣本中分辨出來,但看起來您的文件正在使用“固定寬度”行格式 - 行中的每一列都從給定位置開始並在給定位置結束。 這在當時是一種非常常見的“格式”......
所以你在這里想要的是寫下每一列的名稱,開始和結束位置,這樣你就可以輕松地將行解析為字段,即:
FORMAT_MAP = {
# fieldname : (start, end)
"STATION": (0, 17),
"STATION_NAME": (18, 68),
"ELEVATION": (69, 79),
# etc...
}
def parse_line(line):
return {name: line[start:end].strip() for name, (start, end) in FORMAT_MAP.items()}
現在您可以將文件解析為一系列字段字典:
def iter_parse_file(f, startyear, endyear):
# skip the first two header lines
next(f); next(f)
for line in f:
# we assume the lines are sorted on date, and that the
# date format is YYYYMMDD.
row = parse_line(line)
year = row["DATE"][:4]
if year < startyear:
continue
elif year > endyear:
break
yield row
with open("your/file.ext") as f:
rows = list(iter_parse_file(f, startyear, endyear))
for row in rows:
print("{DATE} : {TEMPMIN} - {TEMPMAX}".format(**row))
您還可以對列值進行過濾、排序等,構建熊貓數據框等。
請注意,您可以(並且可能想要)在解析過程中將數據轉換為正確的類型。 有了上面的起點,你應該能夠很容易地做到這一點。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.