I have a DateTime Index in my DataFrame with multiple columns. As shown:
data_1 data_2
time
2020-01-01 00:23:40 330.98 NaN
2020-01-01 00:23:50 734.52 NaN
2020-01-03 00:00:00 388.06 23.9
2020-01-03 00:00:10 341.60 25.1
2020-01-03 00:00:20 395.14 24.9
...
2020-01-03 00:01:10 341.60 25.1
2020-01-03 00:01:20 395.14 24.9
I want to apply a function on rolling window (It has to be datetime, as i may have missed data, and this one is not my case) and collect some features. Features depend on multiple columns. I wrote my own class:
class FeatureCollector:
def __init__(self):
self.feature_dicts = []
def collect(self, window):
self.feature_dicts.append(extract_features(window))
return 1
def extract_features(window):
ans = {}
# do_smth_on_window and calculate ans
return ans
I run my roll as follows
collector = FeatureCollector()
my_df.rolling(timed(seconds=100), min_periods=10).apply(collector.collect)
features = collector.feature_dicts
But the problem is that extract_features may get only Series object, as I understood. My columns data_1 and data_2 will be passed there in turn as it is such a DataFrame:
data
time
2020-01-01 00:23:40 330.98
2020-01-01 00:23:50 734.52
2020-01-03 00:00:00 388.06
2020-01-03 00:00:10 341.60
2020-01-03 00:00:20 395.14
...
2020-01-03 00:01:10 341.60
2020-01-03 00:01:20 395.14
2020-01-01 00:23:40 NaN
2020-01-01 00:23:50 NaN
2020-01-03 00:00:00 23.9
2020-01-03 00:00:10 25.1
2020-01-03 00:00:20 24.9
...
2020-01-03 00:01:10 25.1
2020-01-03 00:01:20 24.9
How can I organize it in such a way that one window passed to extract_features would be a DataFrame with two columns?
I had this same problem and built the following solution (this solution actually uses grouping as well, but you can adapt it to meet your needs).
This solution is working as expected in one of my environments, but in another environment the rolling() operation is switching the axis to the "on" column, making it no longer work.
The output will include your original dataframe and a list of tuples in my_windows
the tuples are of the form (group, [indexes]), and then you can recover the windows later using
for w in windows:
indexes = w[1]
this_window = df_copy.loc[indexes]
Then you can do whatever you want with your window, which will have all the columns.
def rolling_grouped_windows(df: pd.DataFrame, time_offset: str, grouping_field_name: str, time_field_name: str,
prune_subsets=True) -> tuple[pd.DataFrame, list[tuple]]:
# innermost function stores the indexes of each window, annotating the group index.
def assign_windows(ser: pd.Series, my_df: pd.DataFrame, my_windows: list[tuple], group_id):
print(ser.index)
my_uids = list(my_df.loc[ser.index, 'uid'].values)
# Python's rolling implementation will execute assign_windows() on each column, so we
# retrict action to a single column to avoid duplicating windows.
if -1 in ser.values:
my_windows.append((group_id, my_uids))
return 1 # This is a dummy return because pd.DataFrame.rolling expects numerical return values.
# middle function takes group and passes it through the rolling function after resetting the index
def perform_rolling(df: pd.DataFrame, my_windows: list[tuple]):
group_id = df[grouping_field_name].unique()[0]
dfc = df.reset_index(drop=True)
dfc.drop([grouping_field_name], inplace=True, axis=1)
dfc.rolling(time_offset, on=time_field_name).apply(assign_windows, kwargs={'my_df':dfc, 'my_windows':my_windows,
'group_id':group_id})
# Check one window against another to see if it is a subset and should not be returned
def check_subset(window_1, window_2):
group_1, rows_1 = window_1
group_2, rows_2 = window_2
if group_1 != group_2:
return False
if set(rows_1) <= set(rows_2):
return True
return False
df_copy = df.copy()
df_copy.sort_values([grouping_field_name, time_field_name], inplace=True)
df_copy['uid'] = list(range(len(df_copy))) #adds primary key for later window creation.
my_windows = []
nub = df_copy[[grouping_field_name, time_field_name, 'uid']].copy()
nub['report_'] = -1 # dummy column used so we select exactly one series to grab indexes from.
nub.groupby(grouping_field_name).apply(perform_rolling, my_windows=my_windows)
if prune_subsets:
# need to remove windows that are proper sets of other windows.
pruned_windows = []
for n, (group, rows) in enumerate(my_windows):
if n > 0 :
if check_subset(my_windows[n], my_windows[n - 1]):
continue
if n < (len(my_windows) - 1):
if check_subset(my_windows[n], my_windows[n + 1]):
continue
pruned_windows.append((group, rows))
return df_copy, pruned_windows
return df_copy, my_windows
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.