[英]Pandas rolling window on multiple columns with datetime index
我的 DataFrame 中有一個包含多列的日期時間索引。 如圖所示:
data_1 data_2
time
2020-01-01 00:23:40 330.98 NaN
2020-01-01 00:23:50 734.52 NaN
2020-01-03 00:00:00 388.06 23.9
2020-01-03 00:00:10 341.60 25.1
2020-01-03 00:00:20 395.14 24.9
...
2020-01-03 00:01:10 341.60 25.1
2020-01-03 00:01:20 395.14 24.9
我想在滾動 window 上應用 function (它必須是日期時間,因為我可能錯過了數據,而這不是我的情況)並收集一些功能。 功能取決於多個列。 我寫了自己的 class:
class FeatureCollector:
def __init__(self):
self.feature_dicts = []
def collect(self, window):
self.feature_dicts.append(extract_features(window))
return 1
def extract_features(window):
ans = {}
# do_smth_on_window and calculate ans
return ans
我按如下方式運行我的滾動
collector = FeatureCollector()
my_df.rolling(timed(seconds=100), min_periods=10).apply(collector.collect)
features = collector.feature_dicts
但問題是,據我了解,extract_features 可能只獲得 object 系列。 我的 data_1 和 data_2 列將依次傳遞到那里,因為它是這樣的 DataFrame:
data
time
2020-01-01 00:23:40 330.98
2020-01-01 00:23:50 734.52
2020-01-03 00:00:00 388.06
2020-01-03 00:00:10 341.60
2020-01-03 00:00:20 395.14
...
2020-01-03 00:01:10 341.60
2020-01-03 00:01:20 395.14
2020-01-01 00:23:40 NaN
2020-01-01 00:23:50 NaN
2020-01-03 00:00:00 23.9
2020-01-03 00:00:10 25.1
2020-01-03 00:00:20 24.9
...
2020-01-03 00:01:10 25.1
2020-01-03 00:01:20 24.9
如何以這樣一種方式組織它,即傳遞給 extract_features 的一個 window 將是一個具有兩列的 DataFrame?
我遇到了同樣的問題並構建了以下解決方案(此解決方案實際上也使用了分組,但您可以對其進行調整以滿足您的需求)。
此解決方案在我的一個環境中按預期工作,但在另一個環境中,rolling() 操作將軸切換到“on”列,使其不再工作。
output 將包括您的原始 dataframe 和my_windows
中的元組列表,元組的形式為(組,[索引]),然后您可以稍后使用 08502B8B544526
for w in windows:
indexes = w[1]
this_window = df_copy.loc[indexes]
然后你可以用你的 window 做任何你想做的事情,它會包含所有的列。
def rolling_grouped_windows(df: pd.DataFrame, time_offset: str, grouping_field_name: str, time_field_name: str,
prune_subsets=True) -> tuple[pd.DataFrame, list[tuple]]:
# innermost function stores the indexes of each window, annotating the group index.
def assign_windows(ser: pd.Series, my_df: pd.DataFrame, my_windows: list[tuple], group_id):
print(ser.index)
my_uids = list(my_df.loc[ser.index, 'uid'].values)
# Python's rolling implementation will execute assign_windows() on each column, so we
# retrict action to a single column to avoid duplicating windows.
if -1 in ser.values:
my_windows.append((group_id, my_uids))
return 1 # This is a dummy return because pd.DataFrame.rolling expects numerical return values.
# middle function takes group and passes it through the rolling function after resetting the index
def perform_rolling(df: pd.DataFrame, my_windows: list[tuple]):
group_id = df[grouping_field_name].unique()[0]
dfc = df.reset_index(drop=True)
dfc.drop([grouping_field_name], inplace=True, axis=1)
dfc.rolling(time_offset, on=time_field_name).apply(assign_windows, kwargs={'my_df':dfc, 'my_windows':my_windows,
'group_id':group_id})
# Check one window against another to see if it is a subset and should not be returned
def check_subset(window_1, window_2):
group_1, rows_1 = window_1
group_2, rows_2 = window_2
if group_1 != group_2:
return False
if set(rows_1) <= set(rows_2):
return True
return False
df_copy = df.copy()
df_copy.sort_values([grouping_field_name, time_field_name], inplace=True)
df_copy['uid'] = list(range(len(df_copy))) #adds primary key for later window creation.
my_windows = []
nub = df_copy[[grouping_field_name, time_field_name, 'uid']].copy()
nub['report_'] = -1 # dummy column used so we select exactly one series to grab indexes from.
nub.groupby(grouping_field_name).apply(perform_rolling, my_windows=my_windows)
if prune_subsets:
# need to remove windows that are proper sets of other windows.
pruned_windows = []
for n, (group, rows) in enumerate(my_windows):
if n > 0 :
if check_subset(my_windows[n], my_windows[n - 1]):
continue
if n < (len(my_windows) - 1):
if check_subset(my_windows[n], my_windows[n + 1]):
continue
pruned_windows.append((group, rows))
return df_copy, pruned_windows
return df_copy, my_windows
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.