pandas groupby 在第一次出現列值時應用條件

Question

我有一個如下所示的數據框，其中pid和event_date是應用groupby后的索引。 這次我想再次將groupby應用於pid ，並應用於兩個條件：

一個人 (pid=person) 有兩個或多個 True 標簽；
此人的第一個真實實例發生在他/她未滿 45 歲時；

如果滿足上述兩個條件，則在 groupby-ed dataframe 中將此 person/pid 分配為 True。

                           age      label
  pid       event_date      
00000001    2000-08-28  76.334247   False
            2000-10-17  76.471233   False
            2000-10-31  76.509589   True
            2000-11-02  76.512329   True
... ... ... ...
00000005    2014-08-15  42.769863   False
            2015-04-04  43.476712   False
            2015-11-06  44.057534   True
            2017-03-06  45.386301   True

到目前為止，我只是為了實現第一個條件：

df = (df.groupby(['pid']).apply(lambda x: sum(x['label'])>1).to_frame('label'))

第二個對我來說很棘手。 如何以某些列值的第一次出現為條件？ 非常歡迎任何建議！ 非常感謝！

更新示例 dataframe：

a = pd.DataFrame(columns=['pid', 'event_date', 'age', 'label'])
a['pid'] = [1, 1, 1, 1, 5, 5, 5, 5]
a['event_date'] = ['2000-08-28', '2000-08-28', '2000-08-28', '2000-08-28',\
                  '2000-08-28', '2000-08-28', '2000-08-28', '2000-08-28']
a['event_date'] = pd.to_datetime(a.event_date)
a['age'] = [76.334247, 76.471233, 76.509589, 76.512329, 42.769863, 43.476712, 44.057534, 45.386301]
a['label'] = [False, False, True, True, False, False, True, True]

a = (a.groupby(['pid', 'event_date', 'age']).apply(lambda x: x['label'].any()).to_frame('label'))
a.reset_index(level=['age'], inplace=True)

現在，如果我申請(a.groupby(['pid']).apply(lambda x: sum(x['label'])>1).to_frame('label'))我會得到

    label
pid 
1   True
5   True

僅滿足第一個條件（好吧，因為我跳過了第二個條件）。 添加第二個條件應該只有 label pid=5 True 因為當第一個label=True發生時只有這個人/pid 低於 45。

Answer 1

半（有趣）小時后，我想出了這個：

condition = a.reset_index().groupby('pid')['label'].sum().ge(2) & a.reset_index().groupby('pid').apply(lambda x: x['age'][x['label'].idxmax()] < 45)

Output：

>>> condition
pid
1    False
5     True
dtype: bool

如果索引是正常的，而不是pid + event_date的 MultiIndex ，它可能會縮短一點（刪除兩個.reset_index()調用）。 如果您從一開始就無法避免這種情況並且您不介意更改a ：

a = a.reset_index()
condition = a.groupby('pid')['label'].sum().ge(2) & a.groupby('pid').apply(lambda x: x['age'][x['label'].idxmax()] < 45)

擴展：

condition = (
    a.groupby('pid') # Group by pid
    ['label']        # Get the label column for each group
    .sum()           # Compute the sum of the True values
    .ge(2)           # Are there two or more?
    
    & # Boolean mask. The previous and the next bits of code are the two conditions, and they return a series, where the index is each unique pid, and the value is whether the condition is met for all the rows in that pid
    
    a.groupby('pid')                # Group by pid
    .apply(                         # Call a function for each group, passing the group (a dataframe) to the function as its first parameter
        lambda x:                   # Function start
            x['age'][               # Get item from the age column at the specified index
                x['label'].idxmax() # Get the index of the highest value of the label column (since they're only boolean values, the highest will be the first True value)
            ] < 45                  # Check if it's less than 45
    )
)

pandas groupby 在第一次出現列值時應用條件

問題描述

1 個解決方案

解決方案1
0 2021-11-22 02:43:45

pandas groupby 在第一次出現列值時應用條件

問題描述

1 個解決方案

解決方案1 0 2021-11-22 02:43:45

解決方案1
0 2021-11-22 02:43:45