I need to subset a dataframe (df1) that has measurements (temp) recorded for every 5 minutes, with datetime as the index.
Dataframe df2, contains data on when there has been an event. 0 is the start of the event and 1 is the end of the event. df2 has a column called date, which is the datetime of the start and end of the respective event. The start and end of all events are recorded to the nearest second.
I want to subset df1 based on the times that there has been an event, using the same datetime format as contained in df1 (temp for every 5 minutes).
In the example below, there has been an event between 00:07:00 and 00:14:00, so I would like df3 to contain df1['temp'] 00:05:00 and 00:10:00. There has also been an event between 00:41:00 and 00:44:00, so i would also like df3 to contain 00:40:00.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'temp' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]},
index=pd.date_range('2019-05-02T00:00:00', '2019-05-02T01:00:00', freq='5T'))
df2 = pd.DataFrame({'event' : [0, 1, 0, 1],
'date' : ['2019-05-02-00:07:00', '2019-05-02-00:14:00', '2019-05-02-00:41:00', '2019-05-02-00:44:00']})
df2['date'] = pd.to_datetime(df2['date'])
df3 = pd.DataFrame({'result' : [2, 3, 9],
'date' :['2019-05-02-00:05:00', '2019-05-02-00:10:00', '2019-05-02-00:40:00']})
In my actual work, I have 7 separate df's that each contain different events, which I want to subset df1 and combine, so I end up with a single df that is a subset of all the data in df1, when there has been an event in any of the other 7 df's. df1, in reality, has 37 columns with data that I want to be transferred to the final df3. Once I have the code for the subsetting as above, I was going to merge all of the subset data and delete any duplicates.
You can do it by using resample and concat .
Since you have events which can spawn longer than two bins, you need also a custom resampling function (I found no way to do it better).
event_on = 0
def event_tracker(x):
global event_on
if len(x) > 0:
event_on += x.sum()
return 1
else:
if event_on > 0:
return 1
else:
return 0
idf2 = df2.set_index('date')
idf2['event'].loc[idf2['event'] == 0] = -1
rbdf2 = idf2.resample('5T').apply(event_tracker)
concatenated = pd.concat([df1, rbdf2], axis=1)
df3 = concatenated.loc[concatenated['event'] > 0.0]
df3 = df3.drop('event', axis=1)
Using your sample dataframe, this produces df3
:
temp
2019-05-02 00:05:00 2
2019-05-02 00:10:00 3
2019-05-02 00:40:00 9
Here dates are set as indexes, if for some reason you need to have them as a column add a final line df3 = df3.reset_index()
.
Let me explain what I have done above step by step:
event_tracker
for the resampler. It's a bit dirty because it make use of a global variable, but it's the quickest way I found to do it. Basically, the global variable is used to keep track if there is an event ongoing. It returns 0 if the bin has no event ongoing, 1 otherwise. Then I can go line by line:
'date'
as index. idf2
(start of event) to -1. Needed to correctly perform the math in event_tracker
. resampe
. This function resamples a dataframe with DatetimeIndex
. I used a resampling of 5 minutes ( '5T'
) to match the bins in df1
(print rbdf2
to see it and you will understand). .apply()
is used to apply event_tracker
to each bin and get a 0 or 1 as explained before. concat
to concatenate the two dataframes. event
is > 0, which are the rows where an event is ongoing. 'event'
column. This approach works even if df2
dates are not ordered.
Since you have 7 df2
s, you need to concatenate them before using the above procedure. Simply do:
df2 = pd.concat([df21, df22])
where df21
and df22
are two dataframes with the same structure of your df2
. If you have seven dataframes, the list given to concat
have to contain all sevens dataframes: [df21, df22, df23, ...]
.
Continuing your given minimal example:
# create from df2 a data frame with a 'from' and 'to' column (range data frame)
def df2_like_to_from_to(df2, date_col = 'date'):
"""
Takes 'event' column, collects all '0' and all '1' event rows
and concatenates the columns 'date' to a data frame.
It preserves all other columns from the first ('o') data frame.
(That is why the code is a little more complicated).
And renames the date_col column to 'from' and 'to' and puts them upfront.
"""
df_from = df2[df2.event == 0]
df_to = df2[df2.event == 1]
col_names = [ x if x != date_col else 'from' for x in df2_like.columns]
df_from_to = pd.concat([df_from.reset_index(), df_to.loc[:, 'date'].reset_index()], axis=1)
df_from_to = df_from_to.drop(columns=['index'])
df_from_to.columns = col_names + ['to']
df_res = df_from_to.loc[:, ['from', 'to'] + [x for x in col_names if x != 'from']]
return df_res
range_df = df2_like_to_from_to(df2)
#
# from to event
# 0 2019-05-02 00:07:00 2019-05-02 00:14:00 0
# 1 2019-05-02 00:41:00 2019-05-02 00:44:00 0
#
# filter df1 by its dates overlapping with the range in the range data frame
def filter_by_overlap(dates, df, df_from_to, from_to_col=['from', 'to']):
"""
Filters df rows by overlaps of given dates (one per row) with da data frame
which contains ranges (column names to be given by 'from_to_col' - first for 'from' and second for 'to' values).
The dates are used to build a pseudo-interval which then is searched for
any overlap with the ranges in the ranges data frame.
The df is subsetted for any overlap and returned.
"""
ranges_from_to = df_from_to.loc[:, from_to_col].apply(lambda x: pd.Interval(*x), axis=1)
ranges_date = [pd.Interval(x, x) for x in dates] # pseudo range for data points
selector = [any(x.overlaps(y) for y in ranges_from_to) for x in ranges_date]
return df.loc[selector, :]
filter_by_overlap(df1.index, df1, range_df)
# first argument: the data list/column for which overlaps should be searched
# second argument: the to-be-filtered data frame
# third argument: the range data frame which should select the dates (first argument)
# output:
# temp
# 2019-05-02 00:10:00 3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.