DataFrame 无循环更新行的最快方法

Question

创建场景：

假设 dataframe 有两个系列，其中A是输入， B是A[index]*2的结果：

df = pd.DataFrame({'A': [1, 2, 3],
                   'B': [2, 4, 6]})

假设我收到了 100k 行 dataframe 并在其中搜索错误（此处B->0无效）：

df = pd.DataFrame({'A': [1, 2, 3],
                   'B': [2, 0, 6]})

使用搜索无效行

invalid_rows = df.loc[df['A']*2 != df['B']]

我现在有invalid_rows ，但我不确定用A[index]*2的结果覆盖原始df中无效行的最快方法是什么？

使用iterrows()迭代df是一种选择，但如果df增长，速度会很慢。 我可以以某种方式使用df.update()吗？

带循环的工作解决方案：

index = -1
for row_index, my_series in df.iterrows():
  if myseries['A']*2 != myseries['B']:
    df[index]['B'] = myseries['A']*2

但是有没有更快的方法来做到这一点？

Answer 1

使用mul 、 ne和loc ：

m = df['A'].mul(2).ne(df['B'])
# same as: m = df['A'] * 2 != df['B']
df.loc[m, 'B'] = df['A'].mul(2)

   A  B
0  1  2
1  2  4
2  3  6

m返回一个 boolean 系列，它标记了A * 2 != B所在的行

print(m)

0    False
1     True
2    False
dtype: bool