簡體   English   中英

有沒有一種簡單的方法來計算數據框中每小時的持續時間總和?

[英]Is there a simple way to calculate the sum of the duration for each hour in the dataframe?

我有一個與此類似的數據框:

用戶 開始日期 結束日期 活動
1 2022-05-21 07:23:58 2022-05-21 14:23:48 睡覺
2 2022-05-21 12:59:16 2022-05-21 14:59:16
1 2022-05-21 18:59:16 2022-05-21 21:20:16 工作
3 2022-05-21 18:50:00 2022-05-21 21:20:16 工作

我想要一個數據框,其中的列將是某些活動,而行將包含該小時內所有用戶的每個活動的持續時間的總和。 對不起,我什至很難用語言表達我的想法。 預期的結果應該類似於這個:

小時 睡眠[s] 吃[s] 作品]
00 0 0 0
01 0 0 0
... ... ... ...
07 2162 0 0
08 3600 0 0
... ... ... ...
18 0 0 644
... ... ... ...

數據框有超過 1000 萬行,所以我正在尋找快速的東西。 我試圖用交叉表做一些事情來獲取預期的列並重新采樣以獲取行,但我什至還沒有找到解決方案。 對於如何做到這一點的任何想法,我將不勝感激。 這是我的第一個問題,所以我提前為所有錯誤道歉:)

我假設您的活動開始時間可能超過一天,並且您實際上想要每天的摘要。 如果不是,則可以通過使用datetime.hour來調整答案,而不是對開始時間進行分箱。

答案:

  1. 計算小時間隔頂部的持續時間
  2. 展開 1 小時間隔的行
  3. 按活動旋轉數據框

代碼:

import pandas as pd
import numpy as np
import math
from datetime import datetime, timedelta


data={'users':[1,2,1,3],
      'start':['2022-05-21 07:23:58', '2022-05-21 12:59:16', '2022-05-21 18:59:16', '2022-05-21 18:50:00'],
     'end':[ '2022-05-21 14:23:48', '2022-05-21 14:59:16', '2022-05-21 21:20:16', '2022-05-21 21:20:16'],
'activity':[ 'Sleep', 'Eat', 'Work', 'Work']
}

df=pd.DataFrame(data)

#concert to datetime
df.start=pd.to_datetime(df.start)
df.end=pd.to_datetime(df.end)

#Here is where the answer starts

# function to expand the duration in a list of one hour intervals
def genhourlist(e):
    start = e.start
    end = e.end
    nexthour = datetime(start.year,start.month,start.day,start.hour) + timedelta(hours=1)
    lst=[]
    
    while end > nexthour:
        inter = (nexthour - start)/pd.Timedelta(seconds=1)
        lst.append((datetime(start.year,start.month,start.day,start.hour),inter))
        
        start=nexthour
        nexthour=nexthour+timedelta(hours=1)
    
    inter = (end - start)/pd.Timedelta(seconds=1)
    lst.append((datetime(end.year,end.month,end.day,end.hour),inter))    
               
    return lst
    
# expand the duration
df['duration']=df.apply(genhourlist, axis=1)
df=df.explode('duration')

# update the duration and start
df['start']=df.duration.apply(lambda x: x[0])
df['duration']=df.duration.apply(lambda x: x[1])

pd.pivot_table(df,index=['start'],columns=['activity'], values=['duration'],aggfunc='sum')

結果:

                                duration                
activity                 Eat   Sleep    Work
start                                       
2022-05-21 07:00:00      NaN  2162.0     NaN
2022-05-21 08:00:00      NaN  3600.0     NaN
2022-05-21 09:00:00      NaN  3600.0     NaN
2022-05-21 10:00:00      NaN  3600.0     NaN
2022-05-21 11:00:00      NaN  3600.0     NaN
2022-05-21 12:00:00     44.0  3600.0     NaN
2022-05-21 13:00:00   3600.0  3600.0     NaN
2022-05-21 14:00:00   3556.0  1428.0     NaN
2022-05-21 18:00:00      NaN     NaN   644.0
2022-05-21 19:00:00      NaN     NaN  7200.0
2022-05-21 20:00:00      NaN     NaN  7200.0
2022-05-21 21:00:00      NaN     NaN  2432.0

這是我對您發布的 4 行數據所做的操作。

這是我嘗試做的事情和我的假設:

Goal = for each row, 
- calculate total elapsed time, 
- determine start hour sh, end hour eh, 
- work out how many seconds pertain to sh & eh

then:
-  create a 24-row df with results added up for all users (assuming the data starts at 00:00 and ends at 23:59 on the same day, there isn't any other day, and you're interested in the aggregate result for all users)

我正在使用 For 循環來構建最終的數據框。 如果您的原始數據集中有 10m 行,它可能不夠高效,但它可以在較小的子集上完成工作。 希望你能改進它!

import pandas as pd
import numpy as np


# work out total elapsed time, in seconds
df['total'] = df['end_date'] - df['start_date']
df['total'] = df['total'].dt.total_seconds()

# determine start/end hours (just the value: 07:00 AM => 7)
df['sh'] = df['start_date'].dt.hour
df['eh'] = df['end_date'].dt.hour

#determine start/end hours again
df['next_hour'] = df['start_date'].dt.ceil('H') # 07:23:58 => 08:00:00
df['previous_hour'] = df['end_date'].dt.floor('H') # 14:23:48 => 14:00:00
df

# determine how many seconds pertain to start/end hours
df['sh_s'] = df['next_hour'] - df['start_date']
df['sh_s'] = df['sh_s'].dt.total_seconds()

df['eh_s'] = df['end_date'] - df['previous_hour']
df['eh_s'] = df['eh_s'].dt.total_seconds()

# where start hour & end hour are the same (start=07:20, end=07:30)
my_mask = df['sh'].eq(df['eh']).values
# set sh_s to be 07:30 - 07:20 = 10 minutes = 600 seconds
df['sh_s'] = df['sh_s'].where(~my_mask, other=(df['end_date'] - df['start_date']).dt.total_seconds())
# set eh_s to be 0
df['eh_s'] = df['eh_s'].where(~my_mask, other=0)
df


# add all column hours to df
hour_columns = np.arange(24)
df[hour_columns] = ''
df

# Sorry for this horrible loop below... Hopefully you can improve upon it or get rid of it completely!
for i in range(len(df)): # for each row:
    
    sh = df.iloc[i, df.columns.get_loc('sh')] # get start hour value
    sh_s = df.iloc[i, df.columns.get_loc('sh_s')] #get seconds pertaining to start hour
    
    eh = df.iloc[i, df.columns.get_loc('eh')] # get end hour value
    eh_s = df.iloc[i, df.columns.get_loc('eh_s')] #get seconds pertaining to end hour
    
    # fill in hour columns
    
    # if start hour = end hour:
    if sh == eh:
        df.iloc[i, df.columns.get_loc(sh)] = sh_s
        # but ignore eh_s (it would cancel out the previous line)
    # if start hour != end hour, report both to the hour columns:
    else:
        df.iloc[i, df.columns.get_loc(sh)] = sh_s
        df.iloc[i, df.columns.get_loc(eh)] = eh_s
    
    # for each col between sh & eh, input 3600
    for j in range(sh + 1, eh):
        df.iloc[i, df.columns.get_loc(j)] = 3600

df
df.groupby('activity', as_index=False)[hour_columns].sum().transpose()

結果:

在此處輸入圖像描述

利用:

#get difference in seconds between end and start
tot = df['end_date'].sub(df['start_date']).dt.total_seconds()

#repeat unique index values, if necessary create default
#df = df.reset_index(drop=True)
df = df.loc[df.index.repeat(tot)].copy()
#add timedeltas in second to start datetimes
df['start'] = df['start_date'] + pd.to_timedelta(df.groupby(level=0).cumcount(), unit='s')
#create DatetimeIndex in hours with count activity - total seconds
df = df.groupby([pd.Grouper(freq='H',key='start'), 'activity']).size().unstack()
print (df)
activity                Eat   Sleep    Work
start                                      
2022-05-21 07:00:00     NaN  2162.0     NaN
2022-05-21 08:00:00     NaN  3600.0     NaN
2022-05-21 09:00:00     NaN  3600.0     NaN
2022-05-21 10:00:00     NaN  3600.0     NaN
2022-05-21 11:00:00     NaN  3600.0     NaN
2022-05-21 12:00:00    44.0  3600.0     NaN
2022-05-21 13:00:00  3600.0  3600.0     NaN
2022-05-21 14:00:00  3556.0  1428.0     NaN
2022-05-21 18:00:00     NaN     NaN   644.0
2022-05-21 19:00:00     NaN     NaN  7200.0
2022-05-21 20:00:00     NaN     NaN  7200.0
2022-05-21 21:00:00     NaN     NaN  2432.0

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM