基於另一個系列的熊貓高效分組

Question

我需要改變基於我的DataFrame另一個布爾列的分組操作。 在一個示例中最容易看到：我有以下DataFrame ：

    b          id   
0   False      0
1   True       0
2   False      0
3   False      1
4   True       1
5   True       2
6   True       2
7   False      3
8   True       4
9   True       4
10  False      4

並且想要獲得一個列，如果b列為True，則其元素為True，並且對於給定的id ，它是最后一次為True：

    b          id    lastMention
0   False      0     False
1   True       0     True
2   False      0     False
3   False      1     False
4   True       1     False
5   True       2     True
6   True       3     True
7   False      3     False
8   True       4     False
9   True       4     True
10  False      4     False

我有一個代碼可以實現這一點，雖然效率低下：

def lastMentionFun(df):
    b = df['b']
    a = b.sum()
    if a > 0:
        maxInd = b[b].index.max()
        df.loc[maxInd, 'lastMention'] = True
    return df

df['lastMention'] = False
df = df.groupby('id').apply(lastMentionFun)

有人能提出什么是正確的pythonic方法來做到這一點好又快？

Answer 1

您可以先在列b過濾值為True，然后使用groupby和聚合max獲取max索引值：

print (df[df.b].reset_index().groupby('id')['index'].max())
id
0    1
1    4
2    6
4    9
Name: index, dtype: int64

然后使用loc將值False替換為索引值：

df['lastMention'] = False
df.loc[df[df.b].reset_index().groupby('id')['index'].max(), 'lastMention'] = True

print (df)
        b  id  lastMention
0   False   0        False
1    True   0         True
2   False   0        False
3   False   1        False
4    True   1         True
5    True   2        False
6    True   2         True
7   False   3        False
8    True   4        False
9    True   4         True
10  False   4        False

另一個解決方案 - 使用groupby獲取max索引值並apply ，然后使用isin輸出測試索引中的值的成員資格 - 輸出是boolean Series ：

print (df[df.b].groupby('id').apply(lambda x: x.index.max()))
id
0    1
1    4
2    6
4    9
dtype: int64

df['lastMention'] = df.index.isin(df[df.b].groupby('id').apply(lambda x: x.index.max()))
print (df)
        b  id lastMention
0   False   0       False
1    True   0        True
2   False   0       False
3   False   1       False
4    True   1        True
5    True   2       False
6    True   2        True
7   False   3       False
8    True   4       False
9    True   4        True
10  False   4       False

Answer 2

不確定這是否是最有效的方法，但它只使用內置函數（主要是“cumsum”，然后是max來檢查它是否等於最后一個 - pd.merge只是用來將最大值放回去表，也許有更好的方法來做到這一點？）。

df['cum_b']=df.groupby('id', as_index=False).cumsum()
df = pd.merge(df, df[['id','cum_b']].groupby('id', as_index=False).max(), how='left', on='id', suffixes=('','_max'))
df['lastMention'] = np.logical_and(df.b, df.cum_b == df.cum_b_max)

PS您在示例中指定的數據框從第一個片段到第二個片段稍有變化，我希望我已正確解釋您的請求！

基於另一個系列的熊貓高效分組

問題描述

2 個解決方案

解決方案1
2 已采納 2017-03-21 12:12:10

解決方案2
0 2017-03-21 12:13:35

基於另一個系列的熊貓高效分組

問題描述

2 個解決方案

解決方案1 2 已采納 2017-03-21 12:12:10

解決方案2 0 2017-03-21 12:13:35

解決方案1
2 已采納 2017-03-21 12:12:10

解決方案2
0 2017-03-21 12:13:35