I am trying to do the following using Pandas (Python).
I have a dataframe with the following columns:
Building, Door_Color, Door_Time_Open, Door_Time_Close, Opening_Width
I am trying to group the data by date and time in such a way that for each second I would count the number of doors open and the sum of the width_of_opening.
for example:
Data:
Building, Door_Color, Door_Time_Open, Door_Time_Close, Opening_Width
A , Red , 2000-01-01 00:00:00, 2000-01-01 00:00:05, 10
A , Red , 2000-01-01 00:00:02, 2000-01-01 00:00:04, 5
Result:
Date, Building, Door_Color, Door_Count, Sum_Opening_Width
2000-01-01 00:00:00, A, Red, 1 , 10
2000-01-01 00:00:01, A, Red, 1 , 10
2000-01-01 00:00:02, A, Red, 2 , 15
2000-01-01 00:00:03, A, Red, 2 , 15
2000-01-01 00:00:04, A, Red, 2 , 15
2000-01-01 00:00:05, A, Red, 1 , 10
2000-01-01 00:00:06, A, Red, 0 , 0
I know how to do a regular group by multiple columns and aggregate different columns separately but I haven't got a clue how to get the machine to check if the date we are grouping by falls between the two dates in the data.
Any help would be much appreciated!
edit1: data is a little big, about 6 million rows.
If the data is not too big (covering long period of time), you can do a cross merge:
times = pd.DataFrame({'Date':pd.date_range(df['Door_Time_Open'].min(),
df['Door_Time_Close'].max(), freq='s'),
'dummy':1
})
(df.assign(dummy=1)
.merge(times, on='dummy')
.query('Door_Time_Open<=Date<=Door_Time_Close')
.groupby(['Date','Building','Door_Color'])
['Opening_Width'].agg(['count','sum'])
.reset_index()
)
Output:
Date Building Door_Color count sum
0 2000-01-01 00:00:00 A Red 1 10
1 2000-01-01 00:00:01 A Red 1 10
2 2000-01-01 00:00:02 A Red 2 15
3 2000-01-01 00:00:03 A Red 2 15
4 2000-01-01 00:00:04 A Red 2 15
5 2000-01-01 00:00:05 A Red 1 10
Process the time of each row and then group
def news(r):
df1 = pd.DataFrame()
df1['Date'] = pd.date_range(r['Door_Time_Open'],r['Door_Time_Close'],freq='s')
for idx in ['Building','Door_Color','Opening_Width']:
df1[idx] = r[idx]
return df1
df['Door_Time_Open'] = pd.to_datetime(df['Door_Time_Open'])
df['Door_Time_Close'] = pd.to_datetime(df['Door_Time_Close'])
df_list = []
for idx,row in df.iterrows():
df_list.append(news(row))
data = pd.concat(df_list).groupby(['Date','Building','Door_Color'])['Opening_Width'].agg(['count','sum'])
print(data)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.