Pandas Groupby：根據另一列的值從組的前一個元素中獲取值

Question

我有一個 4 列的數據框。 我事先按“組”和“時間戳”對這個數據框進行了排序。

df = pd.DataFrame(
{
    "type": ['type0', 'type1', 'type2', 'type3', 'type1', 'type3', 'type0', 'type1', 'type3', 'type3'],
    "group": [1, 1, 1, 1, 1, 1, 2, 2, 2, 2],
    "timestamp": ["20220105 07:52:46", "20220105 07:53:11", "20220105 07:53:55", "20220105 07:59:12", "20220105 08:24:13", "20220105 08:48:19", "20220105 11:01:30", "20220105 11:15:16", "20220105 12:13:36", "20220105 12:19:44"],
    "price": [0, 1.5, 2.5, 3, 3.2, 3.1, 0.5, 3, 3.25, pd.NA]
})

>> df
    type  group          timestamp price
0  type0      1  20220105 07:52:46     0
1  type1      1  20220105 07:53:11   1.5
2  type2      1  20220105 07:53:55   2.5
3  type3      1  20220105 07:59:12     3
4  type1      1  20220105 08:24:13   3.2
5  type3      1  20220105 08:48:19   3.1
6  type0      2  20220105 11:01:30   0.5
7  type1      2  20220105 11:15:16     3
8  type3      2  20220105 12:13:36  3.25
9  type3      2  20220105 12:19:44  <NA>

按“組”列分組后，我想按照以下邏輯創建一個“新價格”列：
對於組中的每個 'type3' 行（即 df['type'] = 'type3'），從組中的 PREVIOUS 'type1' 或 'type2' 行獲取價格。 對於 type0/type1/type2 行，保持與輸入數據框中相同的價格。

我的解決方案：

當我們沒有 2 個連續的“type3”行時，我的以下解決方案有效。 但是當有 2 個連續的 'type3' 行時，我得到第二個 'type3' 行的錯誤價格。 我想要組中前一個“type1”或“type2”行的價格，但我使用我的解決方案從第一個“type3”行獲取價格。

df = df.sort_values(by=["group", "timestamp"])
required_types_mask = df['type'].isin(['type1', 'type2', 'type3'])
temp_series = df.loc[:, 'price'].where(required_types_mask).groupby(df['group']).shift(1)
type_3_mask = df['type'].eq('type3')
df.loc[:, 'new_price'] = df.loc[:, 'price'].mask(type_3_mask, temp_series)

我的結果：

    type  group          timestamp price new_price
0  type0      1  20220105 07:52:46     0         0
1  type1      1  20220105 07:53:11   1.5       1.5
2  type2      1  20220105 07:53:55   2.5       2.5
3  type3      1  20220105 07:59:12     3       2.5
4  type1      1  20220105 08:24:13   3.2       3.2
5  type3      1  20220105 08:48:19   3.1       3.2
6  type0      2  20220105 11:01:30   0.5       0.5
7  type1      2  20220105 11:15:16     3         3
8  type3      2  20220105 12:13:36  3.25         3
9  type3      2  20220105 12:19:44  <NA>       3.25 <- Incorrect price

預期結果：

    type  group          timestamp price new_price
0  type0      1  20220105 07:52:46     0         0
1  type1      1  20220105 07:53:11   1.5       1.5
2  type2      1  20220105 07:53:55   2.5       2.5
3  type3      1  20220105 07:59:12     3       2.5
4  type1      1  20220105 08:24:13   3.2       3.2
5  type3      1  20220105 08:48:19   3.1       3.2
6  type0      2  20220105 11:01:30   0.5       0.5
7  type1      2  20220105 11:15:16     3         3
8  type3      2  20220105 12:13:36  3.25         3
9  type3      2  20220105 12:19:44  <NA>         3 <- Correct price

Answer 1

我們可以用 type3 mask價格然后ffill

s = df.price.mask(df.type.isin(['type0','type3']))
df['new'] = np.where(df.type.eq('type3'),s.groupby(df['group']).ffill(),df['price'])
df
    type  group          timestamp price  new
0  type0      1  20220105 07:52:46     0    0
1  type1      1  20220105 07:53:11   1.5  1.5
2  type2      1  20220105 07:53:55   2.5  2.5
3  type3      1  20220105 07:59:12     3  2.5
4  type1      1  20220105 08:24:13   3.2  3.2
5  type3      1  20220105 08:48:19   3.1  3.2
6  type0      2  20220105 11:01:30   0.5  0.5
7  type1      2  20220105 11:15:16     3    3
8  type3      2  20220105 12:13:36  3.25    3
9  type3      2  20220105 12:19:44  <NA>    3

Answer 2

您可以使用一系列掩碼來ffill .

第一個掩碼 'type3' 和 'type0' （后者是為了避免將其用作ffill的源）。 然后恢復'type0'的值。

全部按組完成。

df['new_price'] = (
 df.groupby('group')
   .apply(lambda d: d['price']
            .mask(d['type'].isin(['type3', 'type0'])) # type0/3 to NaN
            .ffill()                                  # fill with previous type1/2
            .mask(d['type'].eq('type0'), d['price'])  # restore type0
         )
   .values
 )

output：

    type  group          timestamp price new_price
0  type0      1  20220105 07:52:46     0         0
1  type1      1  20220105 07:53:11   1.5       1.5
2  type2      1  20220105 07:53:55   2.5       2.5
3  type3      1  20220105 07:59:12     3       2.5
4  type1      1  20220105 08:24:13   3.2       3.2
5  type3      1  20220105 08:48:19   3.1       3.2
6  type0      2  20220105 11:01:30   0.5       0.5
7  type1      2  20220105 11:15:16     3       3.0
8  type3      2  20220105 12:13:36  3.25       3.0
9  type3      2  20220105 12:19:44  <NA>       3.0

Pandas Groupby：根據另一列的值從組的前一個元素中獲取值

問題描述

2 個解決方案

解決方案1
3 已采納 2022-01-15 16:19:05

解決方案2
2 2022-01-15 16:20:37

Pandas Groupby：根據另一列的值從組的前一個元素中獲取值

問題描述

2 個解決方案

解決方案1 3 已采納 2022-01-15 16:19:05

解決方案2 2 2022-01-15 16:20:37

解決方案1
3 已采納 2022-01-15 16:19:05

解決方案2
2 2022-01-15 16:20:37