Pandas 在具有日期時間索引的多列上滾動 window

Question

我的 DataFrame 中有一個包含多列的日期時間索引。 如圖所示：

                     data_1   data_2
time                                      
2020-01-01 00:23:40  330.98      NaN
2020-01-01 00:23:50  734.52      NaN
2020-01-03 00:00:00  388.06     23.9
2020-01-03 00:00:10  341.60     25.1
2020-01-03 00:00:20  395.14     24.9
...
2020-01-03 00:01:10  341.60     25.1
2020-01-03 00:01:20  395.14     24.9

我想在滾動 window 上應用 function （它必須是日期時間，因為我可能錯過了數據，而這不是我的情況）並收集一些功能。 功能取決於多個列。 我寫了自己的 class：

class FeatureCollector:
    def __init__(self):
        self.feature_dicts = []

    def collect(self, window):
        self.feature_dicts.append(extract_features(window))
        return 1

def extract_features(window):
    ans = {}
    # do_smth_on_window and calculate ans
    return ans

我按如下方式運行我的滾動

collector = FeatureCollector()
my_df.rolling(timed(seconds=100), min_periods=10).apply(collector.collect)
features = collector.feature_dicts

但問題是，據我了解，extract_features 可能只獲得 object 系列。 我的 data_1 和 data_2 列將依次傳遞到那里，因為它是這樣的 DataFrame：

                       data
time                                      
2020-01-01 00:23:40  330.98
2020-01-01 00:23:50  734.52
2020-01-03 00:00:00  388.06
2020-01-03 00:00:10  341.60
2020-01-03 00:00:20  395.14
...
2020-01-03 00:01:10  341.60
2020-01-03 00:01:20  395.14                                 
2020-01-01 00:23:40     NaN
2020-01-01 00:23:50     NaN
2020-01-03 00:00:00    23.9
2020-01-03 00:00:10    25.1
2020-01-03 00:00:20    24.9
...
2020-01-03 00:01:10    25.1
2020-01-03 00:01:20    24.9

如何以這樣一種方式組織它，即傳遞給 extract_features 的一個 window 將是一個具有兩列的 DataFrame？

Answer 1

我遇到了同樣的問題並構建了以下解決方案（此解決方案實際上也使用了分組，但您可以對其進行調整以滿足您的需求）。

此解決方案在我的一個環境中按預期工作，但在另一個環境中，rolling() 操作將軸切換到“on”列，使其不再工作。

output 將包括您的原始 dataframe 和my_windows中的元組列表，元組的形式為（組，[索引]），然后您可以稍后使用 08502B8B544526

for w in windows:
    indexes = w[1]
    this_window = df_copy.loc[indexes]

然后你可以用你的 window 做任何你想做的事情，它會包含所有的列。

def rolling_grouped_windows(df: pd.DataFrame, time_offset: str, grouping_field_name: str, time_field_name: str,
    prune_subsets=True) -> tuple[pd.DataFrame, list[tuple]]:
    # innermost function stores the indexes of each window, annotating the group index.
    def assign_windows(ser: pd.Series, my_df: pd.DataFrame, my_windows: list[tuple], group_id):
        print(ser.index)
        my_uids = list(my_df.loc[ser.index, 'uid'].values)
        # Python's rolling implementation will execute assign_windows() on each column, so we 
        # retrict action to a single column to avoid duplicating windows.
        if -1 in ser.values:
            my_windows.append((group_id, my_uids))
        return 1 # This is a dummy return because pd.DataFrame.rolling expects numerical return values.
    
    # middle function takes group and passes it through the rolling function after resetting the index
    def perform_rolling(df: pd.DataFrame, my_windows: list[tuple]):
        group_id = df[grouping_field_name].unique()[0]
        dfc = df.reset_index(drop=True)
        dfc.drop([grouping_field_name], inplace=True, axis=1)
        dfc.rolling(time_offset, on=time_field_name).apply(assign_windows, kwargs={'my_df':dfc, 'my_windows':my_windows, 
            'group_id':group_id})
    
    # Check one window against another to see if it is a subset and should not be returned
    def check_subset(window_1, window_2):
        group_1, rows_1 = window_1
        group_2, rows_2 = window_2
        if group_1 != group_2:
            return False
        if set(rows_1) <= set(rows_2):
            return True
        return False
    df_copy = df.copy()
    df_copy.sort_values([grouping_field_name, time_field_name], inplace=True)
    df_copy['uid'] = list(range(len(df_copy)))  #adds primary key for later window creation.
    my_windows = []
    nub = df_copy[[grouping_field_name, time_field_name, 'uid']].copy()
    nub['report_'] = -1 # dummy column used so we select exactly one series to grab indexes from.
    nub.groupby(grouping_field_name).apply(perform_rolling, my_windows=my_windows)
    if prune_subsets:
        # need to remove windows that are proper sets of other windows.
        pruned_windows = []
        for n, (group, rows) in enumerate(my_windows):
            if n > 0 :
                if check_subset(my_windows[n], my_windows[n - 1]):
                    continue
            if n < (len(my_windows) - 1):
                if check_subset(my_windows[n], my_windows[n + 1]):
                    continue
            pruned_windows.append((group, rows))
        return df_copy, pruned_windows
    return df_copy, my_windows

Pandas 在具有日期時間索引的多列上滾動 window

問題描述

1 個解決方案

解決方案1
0 2022-06-11 17:37:23

Pandas 在具有日期時間索引的多列上滾動 window

問題描述

1 個解決方案

解決方案1 0 2022-06-11 17:37:23

解決方案1
0 2022-06-11 17:37:23