简体   繁体   中英

Pandas rolling window on multiple columns with datetime index

I have a DateTime Index in my DataFrame with multiple columns. As shown:

                     data_1   data_2
time                                      
2020-01-01 00:23:40  330.98      NaN
2020-01-01 00:23:50  734.52      NaN
2020-01-03 00:00:00  388.06     23.9
2020-01-03 00:00:10  341.60     25.1
2020-01-03 00:00:20  395.14     24.9
...
2020-01-03 00:01:10  341.60     25.1
2020-01-03 00:01:20  395.14     24.9

I want to apply a function on rolling window (It has to be datetime, as i may have missed data, and this one is not my case) and collect some features. Features depend on multiple columns. I wrote my own class:

class FeatureCollector:
    def __init__(self):
        self.feature_dicts = []

    def collect(self, window):
        self.feature_dicts.append(extract_features(window))
        return 1

def extract_features(window):
    ans = {}
    # do_smth_on_window and calculate ans
    return ans

I run my roll as follows

collector = FeatureCollector()
my_df.rolling(timed(seconds=100), min_periods=10).apply(collector.collect)
features = collector.feature_dicts

But the problem is that extract_features may get only Series object, as I understood. My columns data_1 and data_2 will be passed there in turn as it is such a DataFrame:

                       data
time                                      
2020-01-01 00:23:40  330.98
2020-01-01 00:23:50  734.52
2020-01-03 00:00:00  388.06
2020-01-03 00:00:10  341.60
2020-01-03 00:00:20  395.14
...
2020-01-03 00:01:10  341.60
2020-01-03 00:01:20  395.14                                 
2020-01-01 00:23:40     NaN
2020-01-01 00:23:50     NaN
2020-01-03 00:00:00    23.9
2020-01-03 00:00:10    25.1
2020-01-03 00:00:20    24.9
...
2020-01-03 00:01:10    25.1
2020-01-03 00:01:20    24.9

How can I organize it in such a way that one window passed to extract_features would be a DataFrame with two columns?

I had this same problem and built the following solution (this solution actually uses grouping as well, but you can adapt it to meet your needs).

This solution is working as expected in one of my environments, but in another environment the rolling() operation is switching the axis to the "on" column, making it no longer work.

The output will include your original dataframe and a list of tuples in my_windows the tuples are of the form (group, [indexes]), and then you can recover the windows later using

for w in windows:
    indexes = w[1]
    this_window = df_copy.loc[indexes]

Then you can do whatever you want with your window, which will have all the columns.

def rolling_grouped_windows(df: pd.DataFrame, time_offset: str, grouping_field_name: str, time_field_name: str,
    prune_subsets=True) -> tuple[pd.DataFrame, list[tuple]]:
    # innermost function stores the indexes of each window, annotating the group index.
    def assign_windows(ser: pd.Series, my_df: pd.DataFrame, my_windows: list[tuple], group_id):
        print(ser.index)
        my_uids = list(my_df.loc[ser.index, 'uid'].values)
        # Python's rolling implementation will execute assign_windows() on each column, so we 
        # retrict action to a single column to avoid duplicating windows.
        if -1 in ser.values:
            my_windows.append((group_id, my_uids))
        return 1 # This is a dummy return because pd.DataFrame.rolling expects numerical return values.
    
    # middle function takes group and passes it through the rolling function after resetting the index
    def perform_rolling(df: pd.DataFrame, my_windows: list[tuple]):
        group_id = df[grouping_field_name].unique()[0]
        dfc = df.reset_index(drop=True)
        dfc.drop([grouping_field_name], inplace=True, axis=1)
        dfc.rolling(time_offset, on=time_field_name).apply(assign_windows, kwargs={'my_df':dfc, 'my_windows':my_windows, 
            'group_id':group_id})
    
    # Check one window against another to see if it is a subset and should not be returned
    def check_subset(window_1, window_2):
        group_1, rows_1 = window_1
        group_2, rows_2 = window_2
        if group_1 != group_2:
            return False
        if set(rows_1) <= set(rows_2):
            return True
        return False
    df_copy = df.copy()
    df_copy.sort_values([grouping_field_name, time_field_name], inplace=True)
    df_copy['uid'] = list(range(len(df_copy)))  #adds primary key for later window creation.
    my_windows = []
    nub = df_copy[[grouping_field_name, time_field_name, 'uid']].copy()
    nub['report_'] = -1 # dummy column used so we select exactly one series to grab indexes from.
    nub.groupby(grouping_field_name).apply(perform_rolling, my_windows=my_windows)
    if prune_subsets:
        # need to remove windows that are proper sets of other windows.
        pruned_windows = []
        for n, (group, rows) in enumerate(my_windows):
            if n > 0 :
                if check_subset(my_windows[n], my_windows[n - 1]):
                    continue
            if n < (len(my_windows) - 1):
                if check_subset(my_windows[n], my_windows[n + 1]):
                    continue
            pruned_windows.append((group, rows))
        return df_copy, pruned_windows
    return df_copy, my_windows

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM