[英]Python pandas, calculate average temperature sensor for 15 minutes before survey was filled in (match timestamps + add new column)
我正在嘗試在包含調查數據 (=PIT_da.xlsx) 的 excel 后面添加新列。 在這些列中,應計算並添加填寫調查前 15、30 和 60 分鍾的平均傳感器值(例如溫度)。 傳感器數據位於 excel 文件“IEQ_da.xlsx”(包括時間戳)中。
我是這樣開始的:
#import raw file
import pandas as pd
import numpy as np
dfSD = pd.read_excel('IEQ_da.xlsx')
dfPIT = pd.read_excel('PIT_da.xlsx')
#main aim: add after each survey result row in PIT_da.xlsx columns for the average values of the indoor environmental quality parameters in 15/30/60 minutes before submitting the survey
#Step 0: set both timestamp and submitdate to right datetime object
dfSD['timestamp'] = pd.to_datetime(dfSD['timestamp'], format='%d%b%Y:%H:%M:%S.%f')
dfPIT['submitdate'] = pd.to_datetime(dfPIT['submitdate'], format='%d%b%Y:%H:%M:%S.%f')
#Step 1: introduce arrays and set to numpy
array1 = dfSD[['timestamp']].to_numpy().ravel()
array2 = dfPIT[['submitdate']].to_numpy().ravel()
data_sensorID = dfSD[['devid']].to_numpy().ravel()
survey_sensorID = dfPIT[['PIT5']].to_numpy().ravel()Each survey has a timestamp (=submitdate) and should be matched to the sensor data at that timestamp.
將時間轉換為數字,以便能夠計算 15min /30min/60min 的差異
#Step 2: set timestamps to number and define a match
from datetime import datetime
def timestamps(x) :
Timestamps = np.empty(x.size)
for i in range(x.size) :
date = x[i]
dt64 = np.datetime64(date)
timestamp = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
Timestamps[i] = timestamp
return Timestamps
array1TS = timestamps(array1)
array2TS = timestamps(array2)
接下來,為每個調查提交時間和傳感器時間戳(已經四舍五入到最接近的 5 分鍾)進行匹配,包括來自相同傳感器設備 ID (=devid) 和 PIT5 的條件(調查中詢問傳感器 ID 的問題)附近的傳感器)。
#Step 3: define match with conditions: must be same timestamp and must have same sensor ID, by means of a matrix
Match = np.empty([array1TS.size, array2TS.size])
for i in range(array1TS.size) :
for j in range(array2TS.size):
if (data_sensorID[i] == survey_sensorID[j]):
if (array1TS[i] == array2TS[j]):
Match[i,j] = 1;
else:
Match[i,j] = 0;
現在,通過此匹配,應將一個新列添加到“PIT_da.xlsx”,其平均值為 de IEQ_da.xlsx 文件中“SENtemp”列的匹配時間戳前 15 分鍾的平均值(帶有溫度值)。
問題: 1. 如何從“匹配”中選擇 go 以在匹配的時間戳前 15 分鍾從該時間戳中選擇所有行。 2. 如何計算這些選定行的平均值(忽略空單元格)並將其放入 PIT_da.xlsx 中的新列(此新列應命名為“SENtemp_15”,用於填寫調查前 15 分鍾的溫度)在)。
供參考使用的一些數據行:
IEQ_da.xlsx
import pandas as pd
df = pd.DataFrame({'timestamp' : ['14/04/2020 00:18:00', '14/04/2020 00:18:05', '14/04/2020 00:17:55', '14/04/2020 00:17:50' , '14/04/2020 00:17:40', '14/04/2020 00:17:40', '14/04/2020 00:17:20', '14/04/2020 00:17:20'], 'devid' : ['4', '2', '4', '2', '4' , '2' , '4' , '2'],
'SENtemp' : ['20,2', '18,8', '20,1', '19', '20,2', '18,8', '20,1', '18,9']})
df
PIT_da.xlsx
import pandas as pd
df = pd.DataFrame({'submitdate' : ['14/04/2020 00:18:00', '14/04/2020 00:18:05'], 'PIT5' : ['4', '2'],
})
df
我希望有人願意幫助我!
您的 2 個初始步驟相當無用。 您可以直接在dfPIT
上使用apply
來構建新列。 最難的部分是SENtemp
是一個以逗號為小數點的字符串列,不能直接轉換為浮點數。 可能的代碼:
delta = [15, 30, 60] # delta in minutes
columns = [f'Average{i}' for i in delta] # column names per delta values
dfPIT[columns] = dfPIT.apply(axis=1, func=lambda x: pd.Series(
[dfSD.loc[(dfSD['timestamp']>x['submitdate'] - pd.Timedelta(i, 'T'))
&(dfSD['timestamp']<=x['submitdate']), 'SENtemp']
.str.replace(',','.').astype('float').mean() for i in delta],
index=columns))
使用您的樣本數據,它給出:
submitdate PIT5 Average15 Average30 Average60
0 2020-04-14 00:18:00 4 19.614286 19.614286 19.614286
1 2020-04-14 00:18:05 2 19.512500 19.512500 19.512500
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.