Pandas 數據框列上帶有計數器的矢量化函數

Question

考慮這個熊貓數據框，當value低於 5（任何閾值）時， condition列是 1。

import pandas as pd
d = {'value': [30,100,4,0,80,0,1,4,70,70],'condition':[0,0,1,1,0,1,1,1,0,0]}
df = pd.DataFrame(data=d)
df

Out[1]:
   value  condition
0     30          0
1    100          0
2      4          1
3      0          1
4     80          0
5      0          1
6      1          1
7      4          1
8     70          0
9     70          0

我想要的是讓所有低於 5 的連續值具有相同的 id，並且所有高於 5 的值都有 0（或 NA 或負值，無所謂，它們只需要相同）。 我想創建一個名為new_id的新列，其中包含這些累積 ID，如下所示：

   value  condition  new_id
0     30          0       0
1    100          0       0
2      4          1       1
3      0          1       1
4     80          0       0
5      0          1       2
6      1          1       2
7      4          1       2
8     70          0       0
9     70          0       0

在一個非常低效的 for 循環中，我會這樣做（有效）：

for i in range(0,df.shape[0]):
    if (df.loc[df.index[i],'condition'] == 1) & (df.loc[df.index[i-1],'condition']==0):
        new_id = counter # assign new id 
        counter += 1 

    elif (df.loc[df.index[i],'condition']==1) & (df.loc[df.index[i-1],'condition']!=0):
        new_id = counter-1 # assign current id

    elif (df.loc[df.index[i],'condition']==0):
        new_id = df.loc[df.index[i],'condition'] # assign 0

    df.loc[df.index[i],'new_id'] = new_id
df

但這非常低效，而且我有一個非常大的數據集。 因此，我嘗試了不同類型的矢量化，但到目前為止我未能阻止它在連續點的每個“集群”內計數：

# First try using cumsum():
df['new_id'] = 0
df['new_id_temp'] = ((df['condition'] == 1)).astype(int).cumsum()
df.loc[(df['condition'] == 1), 'new_id'] = df['new_id_temp']
df[['value', 'condition', 'new_id']]

# Another try using list comprehension but this just does +1:
[row+1 for ind, row in enumerate(df['condition']) if (row != row-1)]

我還嘗試將apply()與自定義 if else 函數一起使用，但似乎這不允許我使用計數器。

已經有大量關於此的類似帖子，但沒有一個為連續行保留相同的 ID。

示例帖子是：在單獨的列條件 Python 列表理解 Pandas cumsum 中維護計數 Python - 在列表理解中保留計數器 python pandas 條件累積總和累積總和數據框的條件計數 - 循環列

Answer 1

歡迎來到 SO！ 為什么不只依賴基礎 Python 呢？

def counter_func(l):
    new_id = [0]   # First value is zero in any case
    counter = 0
    for i in range(1, len(l)):
        if l[i] == 0:
            new_id.append(0)
        elif l[i] == 1 and l[i-1] == 0:
            counter += 1
            new_id.append(counter)
        elif l[i] == l[i-1] == 1:
            new_id.append(counter)
        else: new_id.append(None)
    return new_id

df["new_id"] = counter_func(df["condition"])

看起來像這樣

   value  condition  new_id
0     30          0       0
1    100          0       0
2      4          1       1
3      0          1       1
4     80          0       0
5      0          1       2
6      1          1       2
7      4          1       2
8     70          0       0
9     70          0       0

編輯：

您也可以使用numba ，它對我來說大大加快了功能：大約 1 秒到 60 毫秒。

您應該在函數中輸入 numpy 數組以使用它，這意味着您必須df["condition"].values 。

from numba import njit
import numpy as np
@njit
def func(arr):
    res = np.empty(arr.shape[0])
    counter = 0
    res[0] = 0 # First value is zero anyway
    for i in range(1, arr.shape[0]):
        if arr[i] == 0:
            res[i] = 0
        elif arr[i] and arr[i-1] == 0:
            counter += 1
            res[i] = counter
        elif arr[i] == arr[i-1] == 1:
            res[i] = counter
        else: res[i] = np.nan
    return res

df["new_id"] = func(df["condition"].values)

Answer 2

您可以使用cumsum() ，就像您在第一次嘗試時所做的那樣，只需稍微修改一下：

# calculate delta
df['delta'] = df['condition']-df['condition'].shift(1)
# get rid of -1 for the cumsum (replace it by 0)
df['delta'] = df['delta'].replace(-1,0)

# cumulative sum conditional: multiply with condition column
df['cumsum_x'] = df['delta'].cumsum()*df['condition']

Pandas 數據框列上帶有計數器的矢量化函數

問題描述

2 個解決方案

解決方案1
0 2020-11-20 12:49:08

解決方案2
0 已采納 2020-11-23 09:24:08

Pandas 數據框列上帶有計數器的矢量化函數

問題描述

2 個解決方案

解決方案1 0 2020-11-20 12:49:08

解決方案2 0 已采納 2020-11-23 09:24:08

解決方案1
0 2020-11-20 12:49:08

解決方案2
0 已采納 2020-11-23 09:24:08