[英]Python - Finding average of a column in a CSV given a value in another column (data from a specific year in a file with multiple years)?
此代碼中使用的 CSV 文件是空氣質量傳感器數據文件。 在某些情況下,它們會在多年內每小時記錄粒子濃度。 我正在使用大約 100 個 CSV 文件。 我已經想出了如何查看每個文件並對變量進行平均而不考慮年份,但是我很難找到僅 2020 年的平均值。
代碼的目標是找出每個傳感器在 2020 年運行的平均小時數。
# import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
# Read in table summarizing key variables about each Purple Air station around Pittsburgh
summary_table = pd.read_csv('Pittsburgh Hourly Averaged PM Data.csv')
# Subset the table to include only stations to be used in analysis
summary_table = summary_table[summary_table['Y/N'] == 'Y']
# Number of stations
print('Initial number of stations: ', len(summary_table))
num_hr = []
# Loop through all rows in the summary data table. For each row, find filename
# of the station corresponding to the row and read in that station data.
hours_utc = ['00','01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23']
for i in summary_table.index:
station_data = pd.read_csv('Hourly_Averages/Excel_Data/' + summary_table.at[i,'Filename'] + '.csv')
if station_data['year'] == 2020:
# num_hr.append(station_data['PM2.5_CF1_ug/m3'].mean())
station_data = station_data[station_data['hr'] == h]
print(num_hr)
with open('average_hr.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(num_hr)
代碼使用的 CSV 示例(完整的 CSV 有數千行長,我不知道如何將完整文件放入問題中)。
, Unnamed: 0, Unnamed: 0.1, Unnamed: 0.1.1, Unnamed: 0.1.1.1, created_at, PM1.0_CF1_ug/m3, PM2.5_CF1_ug/m3, PM10.0_CF1_ug/m3, UptimeMinutes, RSSI_dbm, Temperature_F, Humidity_%, PM2.5_ATM_ug/m3, hr, year, month, date, season
0 0 0 0 0 2020-12-23 17:00:00 UTC 0 0.04 0.12 7.5 -39.45 71 14.85 0.04 17 2020 12 12/23/20 Winter
1 1 1 1 1 2020-12-23 18:00:00 UTC 172.9 393.94 489.19 47.41 -36.93 76.34 14.72 261.9 18 2020 12 12/23/20 Winter
2 2 2 2 2 2020-12-23 19:00:00 UTC 77.59 144.78 161.67 101 -37.7 76.17 15.61 95.94 19 2020 12 12/23/20 Winter
3 3 3 3 3 2021-01-07 19:00:00 UTC 103.61 236.47 298.67 28.04 -60.39 76 14.61 157.63 19 2021 1 1/7/21 Winter
4 4 4 4 4 2021-01-07 20:00:00 UTC 11.18 21.12 23.04 64 -59.55 78.91 13.36 19.77 20 2021 1 1/7/21 Winter
5 5 5 5 5 2021-01-13 18:00:00 UTC 59.77 96.07 102.51 13.26 -49.52 73.78 29.48 65.32 18 2021 1 1/13/21 Winter
僅供參考,我對編碼和使用 CSV 文件還很陌生,我的問題可能有一個簡單的答案,但在查看了許多網站后,我仍然卡住了。 我很感激你們中的任何人可能得到的任何幫助。
想象一下這是你的桌子:
我試圖給你的想法:
如何在其他列的條件下對列執行某些操作:
import pandas as pd
fields = ['Sensor_1','Sensor_2','Sensor_3','Year'] # you can tell pandas that fetch only these attributes
df = pd.read_excel('myData.xlsx' , usecols=fields)
sensor1 = df.Sensor_1.mean()
for x in df:
if(x != 'Year'):
sensor = df[x].where(df['Year'] == 2020).sum() / 14
print(sensor)
結果是:
10.785714285714286 # sensor_1 avg
4.357142857142857 # sensor_2 avg
2.892857142857143 # sensor_3 avg
更多:
我知道在您閱讀代碼后,您想知道是否有任何 function 可以給您average
,答案是肯定的,並且 function 名稱是mean()
但是當您使用mean()
時,它將忽略那些在條件下禁用的行( where(df['Year'] == 2020)
) 所以它會給你錯誤的結果,例如在我的示例中它會給你sum()/ 10
的結果,因為 2021 年有 4 行。
這就是你所需要的,只需用我給你的代碼替換你的屬性名稱,我認為它會對你有所幫助。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.