简体   繁体   English

基于另一个系列的熊猫高效分组

[英]Efficient grouping in pandas based on another Series

I need to erform a grouped operation which is based on another boolean column in my DataFrame . 我需要改变基于我的DataFrame另一个布尔列的分组操作。 It is most easily seen in an example: I have the following DataFrame : 在一个示例中最容易看到:我有以下DataFrame

    b          id   
0   False      0
1   True       0
2   False      0
3   False      1
4   True       1
5   True       2
6   True       2
7   False      3
8   True       4
9   True       4
10  False      4

and would like to obtain a column, whose elements are True if the b column is True and it is the last time it is True for a given id : 并且想要获得一个列,如果b列为True,则其元素为True,并且对于给定的id ,它是最后一次为True:

    b          id    lastMention
0   False      0     False
1   True       0     True
2   False      0     False
3   False      1     False
4   True       1     False
5   True       2     True
6   True       3     True
7   False      3     False
8   True       4     False
9   True       4     True
10  False      4     False

I have a code that achieves this, although inefficiently: 我有一个代码可以实现这一点,虽然效率低下:

def lastMentionFun(df):
    b = df['b']
    a = b.sum()
    if a > 0:
        maxInd = b[b].index.max()
        df.loc[maxInd, 'lastMention'] = True
    return df

df['lastMention'] = False
df = df.groupby('id').apply(lastMentionFun)

Can someone propose what is the correct pythonic approach to do this nice and fast? 有人能提出什么是正确的pythonic方法来做到这一点好又快?

You can first filter values where True in column b and then get max index value with groupby and aggregating max : 您可以先在列b过滤值为True,然后使用groupby和聚合max获取max索引值:

print (df[df.b].reset_index().groupby('id')['index'].max())
id
0    1
1    4
2    6
4    9
Name: index, dtype: int64

Then replace values False by index values with loc : 然后使用loc将值False替换为索引值:

df['lastMention'] = False
df.loc[df[df.b].reset_index().groupby('id')['index'].max(), 'lastMention'] = True

print (df)
        b  id  lastMention
0   False   0        False
1    True   0         True
2   False   0        False
3   False   1        False
4    True   1         True
5    True   2        False
6    True   2         True
7   False   3        False
8    True   4        False
9    True   4         True
10  False   4        False

Another solution - get max index values with groupby and apply , then test membership of values in index with isin - output is boolean Series : 另一个解决方案 - 使用groupby获取max索引值并apply ,然后使用isin输出测试索引中的值的成员资格 - 输出是boolean Series

print (df[df.b].groupby('id').apply(lambda x: x.index.max()))
id
0    1
1    4
2    6
4    9
dtype: int64

df['lastMention'] = df.index.isin(df[df.b].groupby('id').apply(lambda x: x.index.max()))
print (df)
        b  id lastMention
0   False   0       False
1    True   0        True
2   False   0       False
3   False   1       False
4    True   1        True
5    True   2       False
6    True   2        True
7   False   3       False
8    True   4       False
9    True   4        True
10  False   4       False

Not sure if this is the most efficient method, but it uses only built-in functions (the main one being "cumsum" and then max to check that it equals the last one - pd.merge is just used to put the max back in the table, maybe there's a better way for doing that?). 不确定这是否是最有效的方法,但它只使用内置函数(主要是“cumsum”,然后是max来检查它是否等于最后一个 - pd.merge只是用来将最大值放回去表,也许有更好的方法来做到这一点?)。

df['cum_b']=df.groupby('id', as_index=False).cumsum()
df = pd.merge(df, df[['id','cum_b']].groupby('id', as_index=False).max(), how='left', on='id', suffixes=('','_max'))
df['lastMention'] = np.logical_and(df.b, df.cum_b == df.cum_b_max)

PS The dataframe you specified in the example changes slightly from the first to the second snippet, I hope I've interpreted your request correctly! PS您在示例中指定的数据框从第一个片段到第二个片段稍有变化,我希望我已正确解释您的请求!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM