简体   繁体   English

将 2 个 Pandas 列彼此相乘并获得值总和的最快方法

[英]Fastest way to multiply 2 Pandas columns with each other and get the sum of the values

I am doing a lot of calculations multiplying one pandas column named "factor" with another called "value", and then calculate the sum of the multiplication.我正在做很多计算,将一个名为“factor”的 Pandas 列与另一个名为“value”的列相乘,然后计算乘法的总和。

The length of both columns is usually around 200 rows.两列的长度通常约为 200 行。 Given that this is a calculation that I am doing thousands of times in my current project, I need it to be as fast as possible鉴于这是我在当前项目中进行了数千次的计算,我需要它尽可能快

A scaled down version of the code would look like this (only 4 rows)代码的缩小版本看起来像这样(只有 4 行)

  dict = {'factor': [0.25,0.25,0.25,0.25],
        'value': [22000,25000,27000,35000] }

df = pd.DataFrame(dict, columns= ['factor', 'value'])

print((df['factor'] * df['value']).sum())

With it printing out 27250.用它打印出 27250。

Is there a way to get the same result faster?有没有办法更快地获得相同的结果?

You can use numpy - convert columns to 1d arrays by values and then numpy.sum :您可以使用numpy - 按values将列转换为一numpy.sum数组,然后使用numpy.sum

np.random.seed(456)

d = {'factor': np.random.rand(200),
     'value': np.random.randint(1000, size=200)}

df = pd.DataFrame(d, columns= ['factor', 'value'])
#print (df)

In [139]: %timeit ((df['factor'] * df['value']).sum())
245 µs ± 2.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [140]: %timeit (np.sum((df['factor'].values * df['value'].values)))
20.6 µs ± 328 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

If possible some missing values get NaN in output, so need numpy.nansum for prevent it:如果可能的话,一些缺失值会在输出中得到 NaN,所以需要numpy.nansum来防止它:

np.random.seed(456)

d = {'factor': np.random.rand(200),
     'value': np.random.randint(1000, size=200)}

df = pd.DataFrame(d, columns= ['factor', 'value'])
df['value'] = df['value'].mask(df['value'] > 700)
#print (df)

In [144]: %timeit ((df['factor'] * df['value']).sum())
235 µs ± 8.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [145]: %timeit (np.nansum((df['factor'].values * df['value'].values)))
33.3 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM