简体   繁体   中英

Resampling by day and category a DataFrame that have datetime start and datetime end

Question

Given a table (DataFrame) of events, where each event (row) has its datetime of start and datetime of stop and the category of event.

How can I transform this table into a table where each row is a combination of all days and categories with the associated hours on this day for this category of event?

Example

Maybe it's easier to see an example than explain the problem:

I want to transform this DataFrame

datetime_start datetime_end event_category
2021-01-01 10:30:00 2021-01-03 16:30:00 'A'
2021-01-01 09:00:00 2021-01-01 15:30:00 'B'
2021-01-01 22:00:00 2021-01-01 23:00:00 'B'

Into this DataFrame

date event_category sum_of_hours_with_event_active
2021-01-01 'A' 13.5
2021-01-01 'B' 7.5
2021-01-02 'A' 24
2021-01-02 'B' 0
2021-01-03 'A' 16.5
2021-01-03 'B' 0

If you are certain there are no overlapping time periods on the same day within the same event category (or you want to double count those time periods) then you can create the basis of all dates by event categories and merge your timespans onto that DataFrame.

Then by subtracting with clipping we can calculate the total time that event contributes for that day only (resulting negative values don't correspond to that day so they get clipped to 0). Finally, we can sum within day by event.

import pandas as pd

# Enumerate all categories for every day. 
dfb = pd.merge(pd.DataFrame({'event_category': df['event_category'].unique()}),
               pd.DataFrame({'date': pd.date_range(df.datetime_start.dt.normalize().min(),
                                                   df.datetime_end.dt.normalize().max(), freq='D')}),
               how='cross')

# Merge timespans 
m = dfb.merge(df, on='event_category')

# Calculate time for that day
m['sum_hours'] = ((m['datetime_end'].clip(upper=m['date']+pd.offsets.DateOffset(days=1))
                   - m['datetime_start'].clip(lower=m['date']))
                   .clip(lower=pd.Timedelta(0)))

# Sum of hours for event by day
m = (m.groupby(['event_category', 'date'])['sum_hours']
      .sum().dt.total_seconds().div(3600)
      .reset_index())

print(m)
  event_category       date  sum_hours
0              A 2021-01-01       13.5
1              A 2021-01-02       24.0
2              A 2021-01-03       16.5
3              B 2021-01-01        7.5
4              B 2021-01-02        0.0
5              B 2021-01-03        0.0

Data

import pandas as pd

start_times = pd.DatetimeIndex(['2021-01-01 10:30:00', '2021-01-01 09:00:00', '2021-01-01 22:00:00'])
end_times = pd.DatetimeIndex(['2021-01-03 16:30:00', '2021-01-01 15:30:00', '2021-01-01 23:00:00'])
categories = ['A', 'B', 'B']
df = pd.DataFrame({'datetime_start': start_times, 'datetime_end': end_times, 'event_category': categories})

Answer

First we groupby event_category so that the apply works per category. The concatenation of the two series represents the changes in the events, that is, the beginnings and ends of events. The groupby and sum inside the apply are needed in case there are multiple events which start or end at the same time in the same category. The cumulative sum ( cumsum ) gives the total number of events at the times that there were changes, that is, at the times when one or more event started or ended. Next we upsample with asfreq to the desired frequency. This should be at least equal to the time granularity of the data. Finally we resample again (implemented with groupby and Grouper objects) and sum .

Essentially we are counting the number of periods occupied by all the events in each category and multiplying by the length of a period (half hour in the example) and then grouping by day. The DateOffset object is used to parametrize the period.

step = pd.DateOffset(hours=0.5)  # Half hour steps
df.groupby('event_category') \
  .apply(lambda x: pd.concat([pd.Series(1, x['datetime_start']),
                              pd.Series(-1, x['datetime_end'])]) \
         .groupby(level=0) \
         .sum() \
         .cumsum() \
         .asfreq(step, method='ffill')
        ) \
  .groupby([pd.Grouper(level=0), pd.Grouper(level=1, freq='D')]) \
  .sum() * step.hours

This will work for overlapping events in the same category.

Results

event_category
A               2021-01-01    13.5
                2021-01-02    24.0
                2021-01-03    16.5
B               2021-01-01     7.5
dtype: float64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM