Given a table (DataFrame) of events, where each event (row) has its datetime of start and datetime of stop and the category of event.
How can I transform this table into a table where each row is a combination of all days and categories with the associated hours on this day for this category of event?
Maybe it's easier to see an example than explain the problem:
I want to transform this DataFrame
datetime_start | datetime_end | event_category |
---|---|---|
2021-01-01 10:30:00 | 2021-01-03 16:30:00 | 'A' |
2021-01-01 09:00:00 | 2021-01-01 15:30:00 | 'B' |
2021-01-01 22:00:00 | 2021-01-01 23:00:00 | 'B' |
Into this DataFrame
date | event_category | sum_of_hours_with_event_active |
---|---|---|
2021-01-01 | 'A' | 13.5 |
2021-01-01 | 'B' | 7.5 |
2021-01-02 | 'A' | 24 |
2021-01-02 | 'B' | 0 |
2021-01-03 | 'A' | 16.5 |
2021-01-03 | 'B' | 0 |
If you are certain there are no overlapping time periods on the same day within the same event category (or you want to double count those time periods) then you can create the basis of all dates by event categories and merge your timespans onto that DataFrame.
Then by subtracting with clipping we can calculate the total time that event contributes for that day only (resulting negative values don't correspond to that day so they get clipped to 0). Finally, we can sum
within day by event.
import pandas as pd
# Enumerate all categories for every day.
dfb = pd.merge(pd.DataFrame({'event_category': df['event_category'].unique()}),
pd.DataFrame({'date': pd.date_range(df.datetime_start.dt.normalize().min(),
df.datetime_end.dt.normalize().max(), freq='D')}),
how='cross')
# Merge timespans
m = dfb.merge(df, on='event_category')
# Calculate time for that day
m['sum_hours'] = ((m['datetime_end'].clip(upper=m['date']+pd.offsets.DateOffset(days=1))
- m['datetime_start'].clip(lower=m['date']))
.clip(lower=pd.Timedelta(0)))
# Sum of hours for event by day
m = (m.groupby(['event_category', 'date'])['sum_hours']
.sum().dt.total_seconds().div(3600)
.reset_index())
print(m)
event_category date sum_hours
0 A 2021-01-01 13.5
1 A 2021-01-02 24.0
2 A 2021-01-03 16.5
3 B 2021-01-01 7.5
4 B 2021-01-02 0.0
5 B 2021-01-03 0.0
import pandas as pd
start_times = pd.DatetimeIndex(['2021-01-01 10:30:00', '2021-01-01 09:00:00', '2021-01-01 22:00:00'])
end_times = pd.DatetimeIndex(['2021-01-03 16:30:00', '2021-01-01 15:30:00', '2021-01-01 23:00:00'])
categories = ['A', 'B', 'B']
df = pd.DataFrame({'datetime_start': start_times, 'datetime_end': end_times, 'event_category': categories})
First we groupby
event_category so that the apply
works per category. The concatenation of the two series represents the changes in the events, that is, the beginnings and ends of events. The groupby
and sum
inside the apply
are needed in case there are multiple events which start or end at the same time in the same category. The cumulative sum ( cumsum
) gives the total number of events at the times that there were changes, that is, at the times when one or more event started or ended. Next we upsample with asfreq
to the desired frequency. This should be at least equal to the time granularity of the data. Finally we resample again (implemented with groupby
and Grouper
objects) and sum
.
Essentially we are counting the number of periods occupied by all the events in each category and multiplying by the length of a period (half hour in the example) and then grouping by day. The DateOffset
object is used to parametrize the period.
step = pd.DateOffset(hours=0.5) # Half hour steps
df.groupby('event_category') \
.apply(lambda x: pd.concat([pd.Series(1, x['datetime_start']),
pd.Series(-1, x['datetime_end'])]) \
.groupby(level=0) \
.sum() \
.cumsum() \
.asfreq(step, method='ffill')
) \
.groupby([pd.Grouper(level=0), pd.Grouper(level=1, freq='D')]) \
.sum() * step.hours
This will work for overlapping events in the same category.
event_category
A 2021-01-01 13.5
2021-01-02 24.0
2021-01-03 16.5
B 2021-01-01 7.5
dtype: float64
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.