![](/img/trans.png)
[英]In python 2.7 - How to read data from csv, reformat data, and write to new csv
[英]How to compare these data sets from a csv? Python 2.7
我有一個項目,我正在嘗試創建一個程序,該程序將從www.transtats.gov獲取csv數據集,這是美國航空公司航班的數據集。 我的目標是找到從一個機場到另一個機場的航班總體上最嚴重的延誤,這意味着它是“最糟糕的航班”。 到目前為止我有這個:
`import csv
with open('826766072_T_ONTIME.csv') as csv_infile: #import and open CSV
reader = csv.DictReader(csv_infile)
total_delay = 0
flight_count = 0
flight_numbers = []
delay_totals = []
dest_list = [] #create empty list of destinations
for row in reader:
if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
if row['FL_NUM'] not in flight_numbers:
flight_numbers.append(row['FL_NUM'])
if row['DEST'] not in dest_list: #if the dest is not already in the list
dest_list.append(row['DEST']) #append the dest to dest_list
for number in flight_numbers:
for row in reader:
if row['ORIGIN'] == 'BOS': #for flights leaving BOS
if row['FL_NUM'] == number:
if float(row['CANCELLED']) < 1: #if the flight is not cancelled
if float(row['DEP_DELAY']) >= 0: #and the delay is greater or equal to 0 (some flights had negative delay?)
total_delay += float(row['DEP_DELAY']) #add time of delay to total delay
flight_count += 1 #add the flight to total flight count
for row in reader:
for number in flight_numbers:
delay_totals.append(sum(row['DEP_DELAY']))`
我想我可以創建一個航班號列表和這些航班號的總延誤列表,並比較兩者,看看哪個航班的延誤總數最高。 比較兩個列表的最佳方法是什么?
我不確定我是否理解正確,但我認為你應該使用dict
來達到這個目的,其中key是'FL_NUM'
,值是總延遲。
一般來說,我想消除Python代碼中的循環。 對於不大的文件,我通常會讀取一次數據文件並構建一些我可以在最后分析的dict
。 以下代碼未經過測試,因為我沒有原始數據,但遵循我將使用的一般模式。
由於航班由起點,目的地和航班號確定,我會將它們捕獲為tuple
並將其用作我的字典中的關鍵字。
from collections import defaultdict
flight_delays = defaultdict(list) # look this up if you aren't familiar
for row in reader:
if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
if row['CANCELLED'] > 0:
flight = (row['ORIGIN'], row['DEST'], row['FL_NUM'])
flight_delays[flight].append(float(row['DEP_DELAY']))
# Finished reading through data, now I want to calculate average delays
worst_flight = ""
worst_delay = 0
for flight, delays in flight_delays.items():
average_delay = sum(delays) / len(delays)
if average_delay > worst_delay:
worst_flight = flight[0] + " to " + flight[1] + " on FL#" + flight[2]
worst_delay = average_delay
一個非常簡單的解決方案是。 添加兩個新變量:
max_delay = 0
delay_flight = 0
# Change: if float(row['DEP_DELAY']) >= 0: FOR:
if float(row['DEP_DELAY']) > max_delay:
max_delay = float(row['DEP_DELAY'])
delay_flight = #save the row number or flight number for reference.
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.