Jupyter 内核在我使用 .apply 时死亡

Question

I have a very large pandas dataframe (few million rows) that I am manipulating.我有一个非常大的 Pandas 数据框（几百万行）正在处理。 The last column I calculate uses the following code:我计算的最后一列使用以下代码：

df['diff'] = df.apply(lambda row: row.col_a - row.col_b, axis=1)

It is fifty-fifty if the code runs, and if it does, it takes the better part of an hour.如果代码可以运行，则为 550，如果运行，则需要一个小时的时间。 Is there a way in pandas to better run.大熊猫有没有办法更好地运行。 I've started to do some research, I looked at this stackoverflow page ( Why is pandas apply lambda slower than loop here? ), but it is for categorical data.我已经开始做一些研究，我查看了这个 stackoverflow 页面（为什么熊猫在这里应用 lambda 比循环慢？），但它用于分类数据。 I've done some research on Vectorized Operations, but haven't found anything that I think will work.我对矢量化操作进行了一些研究，但没有发现任何我认为可行的方法。 Any help is appreciated.任何帮助表示赞赏。

Answer 1

You're row-by-row way of calculating is more than 5,000 times slower from using a vectorized way on a dataframe of all random integers of shape (10000,4) .与在形状为(10000,4)的所有随机整数的数据帧上使用矢量化方式相比，您的逐行计算方式要慢 5,000 多倍。 Avoid the combination of lambda and axis=1 if all possible and vectorize.尽可能避免lambda和axis=1的组合并进行矢量化。

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))
%timeit df['E'] = df['A'] - df['B']
%timeit df['E'] = df.apply(lambda x: x.A - x.B, axis=1)
df

485 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.48 s ± 45.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # > 5000x slower

Jupyter 内核在我使用 .apply 时死亡

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-10-21 21:38:34

Jupyter 内核在我使用 .apply 时死亡

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-10-21 21:38:34

解决方案1
2 已采纳 2020-10-21 21:38:34