[英]How to dynamically change the filter in a pandas data with subgroups of another column?
[英]How to filter pandas data frame by subgroups with a condition in another column value
我正在努力尋找解決方案,這就是問題所在。
我有一個形式的數據框:
date day_time day_time_counter area
2019-06-05 morning 1 1
2019-06-05 morning 1 2
2019-06-05 morning 1 3
2019-06-05 morning 2 1
2019-06-05 morning 2 2
2019-06-05 morning 2 3
2019-06-05 morning 3 1
2019-06-05 morning 3 3
2019-06-05 evening 1 1
2019-06-05 evening 1 2
2019-06-05 evening 2 1
2019-06-05 evening 2 2
2019-06-05 evening 2 3
每個“日期”,“ date_time”和“ day_time_counter”都有一些子組(我用空行將它們分開,以使其更可見)。 每個子組可以具有一個,兩個或三個“區域”。
我想要的是過濾df,以便每個“日期”和“ day_time”僅獲得一個子組,其中“ day_time_counter”最大,並且包含3個不同的“ area”值(1、2、3),即選定的子組應包含3行,每個“區域”值一行。
意思是,在過濾完上面的df之后,我應該得到OUTPUT:
date day_time day_time_counter area
2019-06-05 morning 2 1
2019-06-05 morning 2 2
2019-06-05 morning 2 3
2019-06-05 evening 2 1
2019-06-05 evening 2 2
2019-06-05 evening 2 3
到目前為止,我僅設法通過獲取具有“ day_time_counter”最大的子組來進行過濾,但是我不知道如何包含具有3個“區域”的完整子組的條件。
df_new = df.sort_values('day_time_counter', ascending=False).drop_duplicates(['area', 'date', 'day_time'])
非常感謝你的幫助!
以下內容將滿足您的需求:
area_grp_cols = ["date", "day_time", "day_time_counter"]
counter_grp_cols = ["date", "day_time"]
result = (
df.assign(area_count=lambda df: df.groupby(area_grp_cols)['area']
.transform("count"))
.loc[lambda df: df["area_count"] == 3]
.drop(columns=["area_count"])
.loc[lambda df: df["day_time_counter"]
== df.groupby(counter_grp_cols)["day_time_counter"]
.transform("max")]
)
輸出:
date day_time day_time_counter area
3 2019-06-05 morning 2 1
4 2019-06-05 morning 2 2
5 2019-06-05 morning 2 3
10 2019-06-05 evening 2 1
11 2019-06-05 evening 2 2
12 2019-06-05 evening 2 3
我認為您想要的輸出應該有所不同(晚上有一天的時間3),所以我認為我的代碼是正確的:
選擇具有所有3個區域的最大值:
m = df.groupby(['date', 'day_time', 'day_time_counter']).area
new_df = []
for k , _ in m:
if len(set( _ )) != 3:
continue
new_df.append(df[(df.date == k[0]) & (df.day_time == k[1]) & (df.day_time_counter == k[2])])
new_df = pd.concat(new_df, join='outer')
過濾最大白天時間:
g = new_df.groupby(['date', 'day_time'])
g.filter(lambda x: len(set(x.area)) == 3)
g = g.day_time_counter.max()
並總結:
itr = [df[(df.date == idx[0]) & (df.day_time == idx[1]) & (df.day_time_counter == value)] for idx, value in zip(g.index, g)]
new_df = pd.concat(itr, join='outer')
new_df
告訴我這是否是你想要的
IIUC:
df['group'] = df['area'].eq(1).cumsum()
df_out = df.groupby(['date','day_time','group'])[['area','day_time_counter']]\
.agg({'area':lambda x: x.nunique()==3,'day_time_counter':'sum'})
df_out.loc[df_out['area'], 'day_time_counter']\
.rank(ascending=False, method='dense').eq(1).loc[lambda x: x]\
.to_frame()\
.merge(df, on=['date','day_time','group'], suffixes=('_',''))[df.columns]
輸出:
area date day_time day_time_counter group
0 1 2019-06-05 evening 2 5
1 2 2019-06-05 evening 2 5
2 3 2019-06-05 evening 2 5
3 1 2019-06-05 morning 2 2
4 2 2019-06-05 morning 2 2
5 3 2019-06-05 morning 2 2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.