简体   繁体   English

如何在 Pandas Dataframe 中基于多个 if,elif 语句填充列的每一行值时消除循环

[英]How to Eliminate for loop in Pandas Dataframe in filling each row values of a column based on multiple if,elif statements

Trying to get rid of for loop to speedup the execution in filling values in Column 'C' based on if, elif conditions involving multiple columns and rows.尝试摆脱 for 循环,以根据涉及多列和多行的 if、elif 条件加速填充“C”列中的值的执行。 Not able to find a proper solution.无法找到合适的解决方案。

tried applying np.where with conditions, choices and default values.尝试将 np.where 与条件、选择和默认值一起应用。 But failed to get expected results as i was unable to extract individual values from pandas series object.但未能获得预期结果,因为我无法从 pandas 系列 object 中提取单个值。

df = pd.DataFrame()
df['A']=['Yes','Yes','No','No','Yes','No','Yes','Yes','Yes','Yes']
df['B']=[1,1,0,1,1,0,1,0,0,1]
df['C']=None
df['D']=['xyz','Yes','No','xyz','Yes','No','xyz','Yes','Yes','Yes']
df['C'][0]='xyz'
for i in range(0,len(df)-1):
    if (df.iloc[1+i, 1]==1) & (df.iloc[i, 2]=="xyz") & (df.iloc[1+i, 0]=="No"):
        df.iloc[1+i, 2] = "Minus"
    elif (df.iloc[1+i, 1]==1) & (df.iloc[i, 2]=="xyz") & (df.iloc[1+i, 0]=="Yes"):
        df.iloc[1+i, 2] = "Plus"
    elif (df.iloc[i, 3]!="xyz") or ((df.iloc[1+i, 1]==0) & (df.iloc[i, 2]=="xyz")):
        df.iloc[1+i, 2] = "xyz"
    elif (df.iloc[1+i, 0]=="Yes") & (df.iloc[i, 2]=="xyz"):
        df.iloc[1+i, 2] = "Plus"
    elif (df.iloc[1+i, 0]=="No") & (df.iloc[i, 2]=="xyz"):
        df.iloc[1+i, 2] = "Minus"
    else:
        df.iloc[1+i, 2] = df.iloc[i, 2]
df

在此处输入图像描述

Expecting help from community in modifying the above code in to a better one with less execution time.期待社区的帮助,将上述代码修改为执行时间更短的更好的代码。 Preferably with numpy Vectorization.最好使用 numpy 矢量化。

The loop can certainly not be efficiently vectorized using Numpy or Pandas because there is a loop carried data dependency on df['C'] .循环当然不能使用 Numpy 或 Pandas 进行有效矢量化,因为循环携带数据依赖于df['C'] The loop is very slow because of Pandas direct indexing and string comparisons.由于 Pandas 直接索引和字符串比较,循环非常慢。 Hopefully, you can use Numba to solve this problem efficiently.希望您可以使用Numba有效地解决这个问题。 You first need to convert the columns into strongly-typed Numpy arrays so Numba can be useful.您首先需要将列转换为强类型Numpy arrays 以便 Numba 有用。 Note that Numba is pretty slow to deal with strings so it is better to perform vectorized check directly with Numpy.请注意,Numba 处理字符串的速度非常慢,因此最好直接使用 Numpy 执行矢量化检查。

Here is the resulting code:这是结果代码:

import numpy as np
import numba as nb

@nb.njit('UnicodeCharSeq(8)[:](bool_[:], int64[:], bool_[:])')
def compute(a, b, d):
    n = a.size
    c = np.empty(n, dtype='U8')
    c[0] = 'xyz'
    for i in range(0, n-1):
        prev_is_xyz = c[i] == 'xyz'
        if b[i+1]==1 and prev_is_xyz and not a[i+1]:
            c[i+1] = 'Minus'
        elif b[i+1]==1 and prev_is_xyz and a[i+1]:
            c[i+1] = 'Plus'
        elif d[i] or (b[i+1]==0 and prev_is_xyz):
            c[i+1] = 'xyz'
        elif a[i+1] and prev_is_xyz:
            c[i+1] = 'Plus'
        elif not a[i+1] and prev_is_xyz:
            c[i+1] = 'Minus'
        else:
            c[i+1] = c[i]
    return c

# Convert the dataframe columns to fast Numpy arrays and precompute some check
a = df['A'].values.astype('U8') == 'Yes'
b = df['B'].values.astype(np.int64)
d = df['D'].values.astype('U8') != 'xyz'

# Compute the result very quickly with Numba
c = compute(a, b, d)

# Store the result back
df['C'].values[:] = c.astype(object)

Here is the resulting performance on my machine:这是在我的机器上产生的性能:

Basic Pandas loops:    2510 us
This Numba code:         20 us

Thus, the Numba implementation is 125 times faster.因此,Numba 实施速度提高了125 倍 In fact, most of the time is spent in the Numpy conversion code and not even in compute .事实上,大部分时间花在 Numpy 转换代码上,甚至没有花在compute上。 The gap should be even bigger on large dataframes.在大型数据帧上,差距应该更大。

Note that the line df['C'].values[:] = c.astype(object) is much faster than the equivalent expression df['C'] = c (about 16 times).请注意,行df['C'].values[:] = c.astype(object)比等效表达式df['C'] = c (大约 16 倍)。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何对熊猫数据框的每一行进行排序并根据行的排序值返回列索引 - How to sort each row of pandas dataframe and return column index based on sorted values of row Pandas 基于另一个 dataframe 将多个列和行值设置为 nan - Pandas Set multiple column and row values to nan based on another dataframe 用每行增加的列偏移填充大熊猫数据框 - Filling a pandas dataframe with an increasing column offset for each row 如何根据每行中的条件将多个字符串添加到 pandas dataframe 中的列中? - How do I add multiple strings to a column in a pandas dataframe based on conditions in each row? 如何遍历 pandas dataframe 中的列中的每一行 - How to loop through each row in a column in a pandas dataframe Pandas DataFrame 基于第一行值的条件正向填充 - Pandas DataFrame conditional forward filling based on first row values 如何通过为每列选择特定范围来消除 dataframe 中的行? - Pandas - How to eliminate rows in a dataframe by selecting a specific range for each column? - Pandas 如何在 pandas dataframe 中查找每一行的顶列值 - How to find the top column values of each row in a pandas dataframe 用不同的值(随机分布)填充DataFrame的一列的每一行 - Filling each row of one column of a DataFrame with different values (a random distribution) 根据条件 dataframe 填充 NaN 值 python pandas - Filling the NaN values in the column dataframe based on condition python pandas
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM