简体   繁体   English

Pandas 数据框列上带有计数器的矢量化函数

[英]Vectorized function with counter on pandas dataframe column

Consider this pandas dataframe where the condition column is 1 when value is below 5 (any threshold).考虑这个熊猫数据框,当value低于 5(任何阈值)时, condition列是 1。

import pandas as pd
d = {'value': [30,100,4,0,80,0,1,4,70,70],'condition':[0,0,1,1,0,1,1,1,0,0]}
df = pd.DataFrame(data=d)
df

Out[1]:
   value  condition
0     30          0
1    100          0
2      4          1
3      0          1
4     80          0
5      0          1
6      1          1
7      4          1
8     70          0
9     70          0

What I want is to have all consecutive values below 5 to have the same id and all values above five have 0 (or NA or a negative value, doesn't matter, they just need to be the same).我想要的是让所有低于 5 的连续值具有相同的 id,并且所有高于 5 的值都有 0(或 NA 或负值,无所谓,它们只需要相同)。 I want to create a new column called new_id that contains these cumulative ids as follows:我想创建一个名为new_id的新列,其中包含这些累积 ID,如下所示:

   value  condition  new_id
0     30          0       0
1    100          0       0
2      4          1       1
3      0          1       1
4     80          0       0
5      0          1       2
6      1          1       2
7      4          1       2
8     70          0       0
9     70          0       0

In a very inefficient for loop I would do this (which works):在一个非常低效的 for 循环中,我会这样做(有效):

for i in range(0,df.shape[0]):
    if (df.loc[df.index[i],'condition'] == 1) & (df.loc[df.index[i-1],'condition']==0):
        new_id = counter # assign new id 
        counter += 1 

    elif (df.loc[df.index[i],'condition']==1) & (df.loc[df.index[i-1],'condition']!=0):
        new_id = counter-1 # assign current id

    elif (df.loc[df.index[i],'condition']==0):
        new_id = df.loc[df.index[i],'condition'] # assign 0

    df.loc[df.index[i],'new_id'] = new_id
df
  

But this is very inefficient and I have a very big dataset.但这非常低效,而且我有一个非常大的数据集。 Therefore I tried different kinds of vectorization but I so far failed to keep it from counting up inside each "cluster" of consecutive points:因此,我尝试了不同类型的矢量化,但到目前为止我未能阻止它在连续点的每个“集群”内计数:

# First try using cumsum():
df['new_id'] = 0
df['new_id_temp'] = ((df['condition'] == 1)).astype(int).cumsum()
df.loc[(df['condition'] == 1), 'new_id'] = df['new_id_temp']
df[['value', 'condition', 'new_id']]

# Another try using list comprehension but this just does +1:
[row+1 for ind, row in enumerate(df['condition']) if (row != row-1)]

I also tried using apply() with a custom if else function but it seems like this does not allow me to use a counter.我还尝试将apply()与自定义 if else 函数一起使用,但似乎这不允许我使用计数器。

There is already a ton of similar posts about this but none of them keep the same id for consecutive rows.已经有大量关于此的类似帖子,但没有一个为连续行保留相同的 ID。

Example posts are: Maintain count in python list comprehension Pandas cumsum on a separate column condition Python - keeping counter inside list comprehension python pandas conditional cumulative sum Conditional count of cumulative sum Dataframe - Loop through columns示例帖子是: 单独的列条件Python 列表理解Pandas cumsum 中维护计数Python - 在列表理解中保留计数器python pandas 条件累积总和累积总和数据框的条件计数 - 循环列

Welcome to SO!欢迎来到 SO! Why not just rely on base Python for this?为什么不只依赖基础 Python 呢?

def counter_func(l):
    new_id = [0]   # First value is zero in any case
    counter = 0
    for i in range(1, len(l)):
        if l[i] == 0:
            new_id.append(0)
        elif l[i] == 1 and l[i-1] == 0:
            counter += 1
            new_id.append(counter)
        elif l[i] == l[i-1] == 1:
            new_id.append(counter)
        else: new_id.append(None)
    return new_id
df["new_id"] = counter_func(df["condition"])

Looks like this看起来像这样

   value  condition  new_id
0     30          0       0
1    100          0       0
2      4          1       1
3      0          1       1
4     80          0       0
5      0          1       2
6      1          1       2
7      4          1       2
8     70          0       0
9     70          0       0

Edit :编辑 :

You can also use numba , which sped up the function quite a lot for me about : about 1sec to ~60ms.您也可以使用numba ,它对我来说大大加快了功能:大约 1 秒到 60 毫秒。

You should input numpy arrays into the function to use it, meaning you'll have to df["condition"].values .您应该在函数中输入 numpy 数组以使用它,这意味着您必须df["condition"].values

from numba import njit
import numpy as np
@njit
def func(arr):
    res = np.empty(arr.shape[0])
    counter = 0
    res[0] = 0 # First value is zero anyway
    for i in range(1, arr.shape[0]):
        if arr[i] == 0:
            res[i] = 0
        elif arr[i] and arr[i-1] == 0:
            counter += 1
            res[i] = counter
        elif arr[i] == arr[i-1] == 1:
            res[i] = counter
        else: res[i] = np.nan
    return res
df["new_id"] = func(df["condition"].values)

You can use the cumsum() , as you did in your first try, just modify it a bit:您可以使用cumsum() ,就像您在第一次尝试时所做的那样,只需稍微修改一下:

# calculate delta
df['delta'] = df['condition']-df['condition'].shift(1)
# get rid of -1 for the cumsum (replace it by 0)
df['delta'] = df['delta'].replace(-1,0)

# cumulative sum conditional: multiply with condition column
df['cumsum_x'] = df['delta'].cumsum()*df['condition']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM