[英]How to Eliminate for loop in Pandas Dataframe in filling each row values of a column based on multiple if,elif statements
Trying to get rid of for loop to speedup the execution in filling values in Column 'C' based on if, elif conditions involving multiple columns and rows.尝试摆脱 for 循环,以根据涉及多列和多行的 if、elif 条件加速填充“C”列中的值的执行。 Not able to find a proper solution.
无法找到合适的解决方案。
tried applying np.where with conditions, choices and default values.尝试将 np.where 与条件、选择和默认值一起应用。 But failed to get expected results as i was unable to extract individual values from pandas series object.
但未能获得预期结果,因为我无法从 pandas 系列 object 中提取单个值。
df = pd.DataFrame()
df['A']=['Yes','Yes','No','No','Yes','No','Yes','Yes','Yes','Yes']
df['B']=[1,1,0,1,1,0,1,0,0,1]
df['C']=None
df['D']=['xyz','Yes','No','xyz','Yes','No','xyz','Yes','Yes','Yes']
df['C'][0]='xyz'
for i in range(0,len(df)-1):
if (df.iloc[1+i, 1]==1) & (df.iloc[i, 2]=="xyz") & (df.iloc[1+i, 0]=="No"):
df.iloc[1+i, 2] = "Minus"
elif (df.iloc[1+i, 1]==1) & (df.iloc[i, 2]=="xyz") & (df.iloc[1+i, 0]=="Yes"):
df.iloc[1+i, 2] = "Plus"
elif (df.iloc[i, 3]!="xyz") or ((df.iloc[1+i, 1]==0) & (df.iloc[i, 2]=="xyz")):
df.iloc[1+i, 2] = "xyz"
elif (df.iloc[1+i, 0]=="Yes") & (df.iloc[i, 2]=="xyz"):
df.iloc[1+i, 2] = "Plus"
elif (df.iloc[1+i, 0]=="No") & (df.iloc[i, 2]=="xyz"):
df.iloc[1+i, 2] = "Minus"
else:
df.iloc[1+i, 2] = df.iloc[i, 2]
df
Expecting help from community in modifying the above code in to a better one with less execution time.期待社区的帮助,将上述代码修改为执行时间更短的更好的代码。 Preferably with numpy Vectorization.
最好使用 numpy 矢量化。
The loop can certainly not be efficiently vectorized using Numpy or Pandas because there is a loop carried data dependency on df['C']
.循环当然不能使用 Numpy 或 Pandas 进行有效矢量化,因为循环携带数据依赖于
df['C']
。 The loop is very slow because of Pandas direct indexing and string comparisons.由于 Pandas 直接索引和字符串比较,循环非常慢。 Hopefully, you can use Numba to solve this problem efficiently.
希望您可以使用Numba有效地解决这个问题。 You first need to convert the columns into strongly-typed Numpy arrays so Numba can be useful.
您首先需要将列转换为强类型Numpy arrays 以便 Numba 有用。 Note that Numba is pretty slow to deal with strings so it is better to perform vectorized check directly with Numpy.
请注意,Numba 处理字符串的速度非常慢,因此最好直接使用 Numpy 执行矢量化检查。
Here is the resulting code:这是结果代码:
import numpy as np
import numba as nb
@nb.njit('UnicodeCharSeq(8)[:](bool_[:], int64[:], bool_[:])')
def compute(a, b, d):
n = a.size
c = np.empty(n, dtype='U8')
c[0] = 'xyz'
for i in range(0, n-1):
prev_is_xyz = c[i] == 'xyz'
if b[i+1]==1 and prev_is_xyz and not a[i+1]:
c[i+1] = 'Minus'
elif b[i+1]==1 and prev_is_xyz and a[i+1]:
c[i+1] = 'Plus'
elif d[i] or (b[i+1]==0 and prev_is_xyz):
c[i+1] = 'xyz'
elif a[i+1] and prev_is_xyz:
c[i+1] = 'Plus'
elif not a[i+1] and prev_is_xyz:
c[i+1] = 'Minus'
else:
c[i+1] = c[i]
return c
# Convert the dataframe columns to fast Numpy arrays and precompute some check
a = df['A'].values.astype('U8') == 'Yes'
b = df['B'].values.astype(np.int64)
d = df['D'].values.astype('U8') != 'xyz'
# Compute the result very quickly with Numba
c = compute(a, b, d)
# Store the result back
df['C'].values[:] = c.astype(object)
Here is the resulting performance on my machine:这是在我的机器上产生的性能:
Basic Pandas loops: 2510 us
This Numba code: 20 us
Thus, the Numba implementation is 125 times faster.因此,Numba 实施速度提高了125 倍。 In fact, most of the time is spent in the Numpy conversion code and not even in
compute
.事实上,大部分时间花在 Numpy 转换代码上,甚至没有花在
compute
上。 The gap should be even bigger on large dataframes.在大型数据帧上,差距应该更大。
Note that the line df['C'].values[:] = c.astype(object)
is much faster than the equivalent expression df['C'] = c
(about 16 times).请注意,行
df['C'].values[:] = c.astype(object)
比等效表达式df['C'] = c
(大约 16 倍)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.