简体   繁体   English

如何在另一个列值中具有条件的子组中筛选熊猫数据帧

[英]How to filter pandas data frame by subgroups with a condition in another column value

I am struggling finding a solution, here is the problem. 我正在努力寻找解决方案,这就是问题所在。

I have a dataframe of the form: 我有一个形式的数据框:

date         day_time   day_time_counter  area

2019-06-05   morning    1                 1
2019-06-05   morning    1                 2
2019-06-05   morning    1                 3

2019-06-05   morning    2                 1
2019-06-05   morning    2                 2
2019-06-05   morning    2                 3

2019-06-05   morning    3                 1
2019-06-05   morning    3                 3

2019-06-05   evening    1                 1
2019-06-05   evening    1                 2

2019-06-05   evening    2                 1
2019-06-05   evening    2                 2
2019-06-05   evening    2                 3

There are some subgroups per "date", "date_time" and "day_time_counter" (which I separated them with a blank line to make them more visible). 每个“日期”,“ date_time”和“ day_time_counter”都有一些子组(我用空行将它们分开,以使其更可见)。 Each sub group can have one, two or three "area". 每个子组可以具有一个,两个或三个“区域”。

What I want is to filter the df in order to get only one subgroup per "date" and "day_time" which has the largest "day_time_counter" AND contain the 3 different "area" values (1, 2, 3), ie the selected subgroups should contain 3 rows, one per "area" value. 我想要的是过滤df,以便每个“日期”和“ day_time”仅获得一个子组,其中“ day_time_counter”最大,并且包含3个不同的“ area”值(1、2、3),即选定的子组应包含3行,每个“区域”值一行。

Meaning, after filtering the df above, I should get as OUTPUT: 意思是,在过滤完上面的df之后,我应该得到OUTPUT:

date         day_time   day_time_counter  area

2019-06-05   morning    2                 1
2019-06-05   morning    2                 2
2019-06-05   morning    2                 3

2019-06-05   evening    2                 1
2019-06-05   evening    2                 2
2019-06-05   evening    2                 3

So far I only managed to filter by getting the subgroup with the largest "day_time_counter" but I do not know how to include the condition of being a complete subgroup with the 3 "area". 到目前为止,我仅设法通过获取具有“ day_time_counter”最大的子组来进行过滤,但是我不知道如何包含具有3个“区域”的完整子组的条件。

df_new = df.sort_values('day_time_counter', ascending=False).drop_duplicates(['area', 'date', 'day_time'])

Thanks a lot for your help! 非常感谢你的帮助!

The following will produce what you're looking for: 以下内容将满足您的需求:

area_grp_cols = ["date", "day_time", "day_time_counter"]
counter_grp_cols = ["date", "day_time"]
result = (
    df.assign(area_count=lambda df: df.groupby(area_grp_cols)['area']
                                      .transform("count"))
      .loc[lambda df: df["area_count"] == 3]
      .drop(columns=["area_count"])
      .loc[lambda df: df["day_time_counter"]
                      == df.groupby(counter_grp_cols)["day_time_counter"]
                           .transform("max")]
)

Output: 输出:

          date day_time  day_time_counter  area
3   2019-06-05  morning                 2     1
4   2019-06-05  morning                 2     2
5   2019-06-05  morning                 2     3
10  2019-06-05  evening                 2     1
11  2019-06-05  evening                 2     2
12  2019-06-05  evening                 2     3

i think your wanted output should be different (evening have day time 3) so i think my code is correct: 我认为您想要的输出应该有所不同(晚上有一天的时间3),所以我认为我的代码是正确的:

choseing the max that has all 3 areas: 选择具有所有3个区域的最大值:

m = df.groupby(['date', 'day_time', 'day_time_counter']).area
new_df = []
for k , _ in m:
    if len(set( _ )) != 3:
        continue
    new_df.append(df[(df.date == k[0]) & (df.day_time == k[1]) & (df.day_time_counter == k[2])])
new_df = pd.concat(new_df, join='outer')

filtering the max daytimes : 过滤最大白天时间:

g = new_df.groupby(['date', 'day_time'])
g.filter(lambda x: len(set(x.area)) == 3)
g = g.day_time_counter.max()

and wrapping up: 并总结:

itr = [df[(df.date == idx[0]) & (df.day_time == idx[1]) & (df.day_time_counter == value)] for idx, value in zip(g.index, g)]
new_df = pd.concat(itr, join='outer')
new_df

tell me if this is what you wanted 告诉我这是否是你想要的

IIUC: IIUC:

df['group'] = df['area'].eq(1).cumsum()

df_out = df.groupby(['date','day_time','group'])[['area','day_time_counter']]\
           .agg({'area':lambda x: x.nunique()==3,'day_time_counter':'sum'})

df_out.loc[df_out['area'], 'day_time_counter']\
      .rank(ascending=False, method='dense').eq(1).loc[lambda x: x]\
      .to_frame()\
      .merge(df, on=['date','day_time','group'], suffixes=('_',''))[df.columns]

Output: 输出:

   area        date day_time  day_time_counter  group
0     1  2019-06-05  evening                 2      5
1     2  2019-06-05  evening                 2      5
2     3  2019-06-05  evening                 2      5
3     1  2019-06-05  morning                 2      2
4     2  2019-06-05  morning                 2      2
5     3  2019-06-05  morning                 2      2

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何动态更改带有另一列子组的熊猫数据中的过滤器? - How to dynamically change the filter in a pandas data with subgroups of another column? 根据 pandas 数据框中另一列中的条件对一列求和 - Summing a column based on a condition in another column in a pandas data frame 如何根据pandas中另一个数据框中的条件更新数据框 - how to update a data frame based on the condition in another data frame in pandas 如何使用 pandas 根据同一数据帧中另一列的条件获取列值的连续平均值 - How to get consecutive averages of the column values based on the condition from another column in the same data frame using pandas 从具有基于另一列的条件的 pandas 数据帧中删除重复项 - Removing duplicates from pandas data frame with condition based on another column 熊猫-使用另一个数据框过滤数据框 - Pandas - Filter data frame with another data frame 根据条件替换熊猫数据框列中的值 - Replace value in a pandas data frame column based on a condition 根据为pandas中另一个数据框中的列提供的条件对数据框的列执行操作 - perform operation on column of data frame based on condition given to column in another data frame in pandas 根据另一列的值向python pandas数据框添加一列 - Adding a column to a python pandas data frame based on the value of another column 根据从末尾开始的列值的计数过滤 Pandas 数据框 - Filter a pandas data frame based on the count of a column value from the end
相关标签
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM