简体   繁体   中英

Transform the Random time intervals to 30 mins Structured interval

I have this dataFrame where some tasks happened time period

                    Date       Start Time              End Time
0     2016-01-01 0:00:00   2016-01-01 0:10:00   2016-01-01 0:25:00
1     2016-01-01 0:00:00   2016-01-01 1:17:00   2016-01-01 1:31:00
2     2016-01-02 0:00:00   2016-01-02 0:30:00   2016-01-02 0:32:00
...                  ...                  ...                  ...

Convert this df to 30 mins interval Expected outcome

                    Date       Hours              
1     2016-01-01 0:30:00        0:15
2     2016-01-01 1:00:00        0:00
3     2016-01-01 1:30:00        0:13
4     2016-01-01 2:00:00        0:01
5     2016-01-01 2:30:00        0:00
6     2016-01-01 3:00:00        0:00
...                  ...            
47     2016-01-01 23:30:00        0:00
48     2016-01-02 23:59:59        0:00
49     2016-01-02 00:30:00        0:00
50     2016-01-02 01:00:00        0:02
...                  ...               

I was trying to do with for loop which was getting tedious. Any simple way to do in pandas.

IIUC you can discard the Date column, get the time difference between start and end, groupby 30 minutes and agg on first (assuming you always have one entry only per 30 minutes slot):

print (df.assign(Diff=df["End Time"]-df["Start Time"])
         .groupby(pd.Grouper(key="Start Time", freq="30T"))
         .agg({"Diff": "first"})
         .fillna(pd.Timedelta(seconds=0)))

                               Diff
Start Time                         
2016-01-01 00:00:00 0 days 00:15:00
2016-01-01 00:30:00 0 days 00:00:00
2016-01-01 01:00:00 0 days 00:14:00
2016-01-01 01:30:00 0 days 00:00:00
2016-01-01 02:00:00 0 days 00:00:00
2016-01-01 02:30:00 0 days 00:00:00
...
2016-01-02 00:30:00 0 days 00:02:00

The idea is to create a series with 0 and DatetimeIndex per minutes between min start time and max end time. Then add 1 where Start Time and subtract 1 where End Time. You can then use cumsum to count the values between Start and End, resample.sum per 30 minutes and reset_index . The last line of code is to get the proper format in the Hours column.

#create a series of 0 with a datetime index 
res = pd.Series(data=0, 
                index= pd.DatetimeIndex(pd.date_range(df['Start Time'].min(), 
                                                      df['End Time'].max(), 
                                                      freq='T'), 
                                        name='Dates'),
                name='Hours')

# add 1 o the start time and remove 1 to the end start
res[df['Start Time']] += 1
res[df['End Time']] -= 1

# cumsum to get the right value for each minute then resample per 30 minutes
res = (res.cumsum()
          .resample('30T', label='right').sum()
          .reset_index('Dates')
      )

# change the format of the Hours column, honestly not necessary
res['Hours'] =  pd.to_datetime(res['Hours'], format='%M').dt.strftime('%H:%M') # or .dt.time

print(res)
                 Dates  Hours
0  2016-01-01 00:30:00  00:15
1  2016-01-01 01:00:00  00:00
2  2016-01-01 01:30:00  00:13
3  2016-01-01 02:00:00  00:01
4  2016-01-01 02:30:00  00:00
5  2016-01-01 03:00:00  00:00
...
48 2016-01-02 00:30:00  00:00
49 2016-01-02 01:00:00  00:02

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM