如何加速 Pandas 多级数据帧总和？

Question

I am trying to speed up the sum for several big multilevel dataframes.我正在尝试加快几个大型多级数据帧的总和。

Here is a sample:这是一个示例：

df1 = mul_df(5000,30,400) # mul_df to create a big multilevel dataframe
#let df2, df3, df4 = df1, df1, df1 to minimize the memory usage, 
#they can also be mul_df(5000,30,400) 
df2, df3, df4 = df1, df1, df1

In [12]: timeit df1+df2+df3+df4
1 loops, best of 3: 993 ms per loop

I am not satisfy with the 993ms, Is there any way to speed up ?我对 993ms 不满意，有什么办法可以加快速度吗？ Can cython improve the performance ? cython 可以提高性能吗？ If yes, how to write the cython code ?如果是，如何编写 cython 代码？ Thanks.谢谢。

Note : mul_df() is the function to create the demo multilevel dataframe.注意： mul_df()是创建演示多级数据帧的函数。

import itertools
import numpy as np
import pandas as pd

def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
    ''' create multilevel dataframe, for example: mul_df(4,2,6)'''

    index_name = ['STK_ID','RPT_Date']
    col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]

    first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
    first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
    second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum

    dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
    dt[index_name[0]] = first_level_dt
    dt[index_name[1]] = second_level_dt

    rst = dt.set_index(index_name, drop=True, inplace=False)
    return rst

Update:更新：

Data on my Pentium Dual-Core T4200@2.00GHZ, 3.00GB RAM, WindowXP, Python 2.7.4, Numpy 1.7.1, Pandas 0.11.0, numexpr 2.0.1 (Anaconda 1.5.0 (32-bit))我的 Pentium Dual-Core T4200@2.00GHZ、3.00GB RAM、WindowXP、Python 2.7.4、Numpy 1.7.1、Pandas 0.11.0、numexpr 2.0.1（Anaconda 1.5.0（32 位））上的数据

In [1]: from pandas.core import expressions as expr
In [2]: import numexpr as ne

In [3]: df1 = mul_df(5000,30,400)
In [4]: df2, df3, df4 = df1, df1, df1

In [5]: expr.set_use_numexpr(False)
In [6]: %timeit df1+df2+df3+df4
1 loops, best of 3: 1.06 s per loop

In [7]: expr.set_use_numexpr(True)
In [8]: %timeit df1+df2+df3+df4
1 loops, best of 3: 986 ms per loop

In [9]: %timeit  DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32')
1 loops, best of 3: 388 ms per loop

Answer 1

method 1: On my machine not so bad (with numexpr disabled)方法 1：在我的机器上还不错（禁用numexpr ）

In [41]: from pandas.core import expressions as expr

In [42]: expr.set_use_numexpr(False)

In [43]: %timeit df1+df2+df3+df4
1 loops, best of 3: 349 ms per loop

method 2: Using numexpr (which is by default enabled if numexpr is installed)方法2：使用numexpr （这是默认启用如果numexpr安装）

In [44]: expr.set_use_numexpr(True)

In [45]: %timeit df1+df2+df3+df4
10 loops, best of 3: 173 ms per loop

method 3: Direct use of numexpr方法三：直接使用numexpr

In [34]: import numexpr as ne

In [46]: %timeit  DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32')
10 loops, best of 3: 47.7 ms per loop

These speedups are achieved using numexpr because:这些加速是使用numexpr实现的，因为：

avoids using intermediate temporary arrays (which in the case you are presenting is probably quite inefficient in numpy, I suspect this is being evaluated like ((df1+df2)+df3)+df4避免使用中间临时数组（在你所展示的情况下，在 numpy 中可能效率很低，我怀疑这是像((df1+df2)+df3)+df4这样的评估
uses multi-cores as available使用多核可用

As I hinted above, pandas uses numexpr under the hood for certain types of ops (in 0.11), eg df1 + df2 would be evaluated this way, however the example you are giving here will result in several calls to numexpr (this is method 2 is faster than method 1.).正如我上面所暗示的，pandas 在numexpr使用numexpr来处理某些类型的操作（在 0.11 中），例如df1 + df2将通过这种方式进行评估，但是您在此处给出的示例将导致多次调用numexpr （这是方法 2比方法 1 快。）。 Using the direct (method 3) ne.evaluate(...) achieves even more speedups.使用直接（方法 3） ne.evaluate(...)可以实现更多的加速。

Note that in pandas 0.13 (0.12 will be released this week), we are implemented a function pd.eval which will in effect do exactly what my example above does.请注意，在 Pandas 0.13（本周将发布 0.12）中，我们实现了一个函数pd.eval ，它实际上将完全按照我上面的示例所做的。 Stay tuned (if you are adventurous this will be in master somewhat soon: https://github.com/pydata/pandas/pull/4037 )请继续关注（如果您喜欢冒险，这将很快掌握： https : //github.com/pydata/pandas/pull/4037 ）

In [5]: %timeit pd.eval('df1+df2+df3+df4')
10 loops, best of 3: 50.9 ms per loop

Lastly to answer your question, cython will not help here at all;最后回答你的问题， cython在这里根本没有帮助； numexpr is quite efficient at this type of problem (that said, there are situation where cython is helpful) numexpr在这类问题上非常有效（也就是说，在某些情况下 cython 很有帮助）

One caveat: in order to use the direct Numexpr method the frames should be already aligned (Numexpr operates on the numpy array and doesn't know anything about the indices).一个警告：为了使用直接 Numexpr 方法，帧应该已经对齐（Numexpr 对 numpy 数组进行操作并且对索引一无所知）。 also they should be a single dtype他们也应该是一个单一的数据类型

Answer 2

Other Observations其他观察

You cannot expect more speedup if you have only 2 cores on your machine.如果您的机器上只有 2 个内核，则不能期望获得更多的加速。 In the end numexpression relies on parallelisation and the performant usage of the cpu cache.最后 numexpression 依赖于并行化和 CPU 缓存的高性能使用。
What you do is to some extend wrong.你所做的在某种程度上是错误的。 Numexpressions on DataFrames are fast, but wrong. DataFrames 上的 Numexpressions 很快，但错误。 They do not return the right result, if the DataFrames are not equally indexed.如果 DataFrame 的索引不相同，则它们不会返回正确的结果。 Different sorting will already trouble you, which I show below.不同的排序已经会给你带来麻烦，我将在下面展示。
If you add DataFrames with different indexes, the whole stuff is not that performant anymore.如果您添加具有不同索引的 DataFrame，则整个内容不再那么高效。 Well, Pandas does quite a good job to add the proper rows for you by looking up the corrsponding index entries.好吧，Pandas 通过查找相应的索引条目为您添加适当的行做得很好。 This comes with the natural cost.这是自然成本。

In the following my Observations: - First, I reproduce your test case and come to other results.在以下我的观察中： - 首先，我重现了您的测试用例并得出了其他结果。 Using numexpression under the hood of Pandas increases performance significantly.在 Pandas 下使用 numexpression 可以显着提高性能。 - Second, I sort one of the four DataFrames in descending order and rerun all cases. - 其次，我按降序对四个 DataFrame 之一进行排序，然后重新运行所有案例。 The performance breaks, and additionally, (as expected) numexpression evaluation on Pandas DataFrames leads to wrong results.性能中断，此外（如预期的那样）对 Pandas DataFrames 的 numexpression 评估会导致错误的结果。

Equal Indices on all Frames所有帧上的相同索引

This case reproduces your case.这个案例再现了你的案例。 The only difference is, that I create copies of the inital DataFrame instance.唯一的区别是，我创建了初始 DataFrame 实例的副本。 So there is nothing shared.所以没有什么可共享的。 There are different objects (ids) in use to make sure, that numexpression can deal with it.使用不同的对象（id）来确保 numexpression 可以处理它。

import itertools
import numpy as np
import pandas as pd

def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
    ''' create multilevel dataframe, for example: mul_df(4,2,6)'''

    index_name = ['STK_ID','RPT_Date']
    col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]

    first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
    first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
    second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum

    dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
    dt[index_name[0]] = first_level_dt
    dt[index_name[1]] = second_level_dt

    rst = dt.set_index(index_name, drop=True, inplace=False)
    return rst

df1 = mul_df(5000,30,400)
df2, df3, df4 = df1.copy(), df1.copy(), df1.copy()

pd.options.compute.use_numexpr = False

%%timeit
df1 + df2 + df3 + df4

564 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

pd.options.compute.use_numexpr = True

%%timeit 
df1 + df2 + df3 + df4

152 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

import numexpr as ne

%%timeit
pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32')

66.4 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

(df1 + df2 + df3 + df4).equals(pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32'))

True

(Slightly) Different Indices on some Frames （略）某些帧上的不同索引

Here I sort one of the DataFrames in descending order, therefore changing the index and reshuffling the rows in the dataframe internal numpy array.在这里，我按降序对 DataFrame 之一进行排序，因此更改索引并重新排列数据帧内部 numpy 数组中的行。

import itertools
import numpy as np
import pandas as pd

def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
    ''' create multilevel dataframe, for example: mul_df(4,2,6)'''

    index_name = ['STK_ID','RPT_Date']
    col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]

    first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
    first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
    second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum

    dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
    dt[index_name[0]] = first_level_dt
    dt[index_name[1]] = second_level_dt

    rst = dt.set_index(index_name, drop=True, inplace=False)
    return rst

df1 = mul_df(5000,30,400)
df2, df3, df4 = df1.copy(), df1.copy(), df1.copy().sort_index(ascending=False)

pd.options.compute.use_numexpr = False

%%timeit
df1 + df2 + df3 + df4

1.36 s ± 67.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

pd.options.compute.use_numexpr = True

%%timeit 
df1 + df2 + df3 + df4

928 ms ± 39.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

import numexpr as ne

%%timeit
pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32')

68 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

(df1 + df2 + df3 + df4).equals(pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32'))

False

Conclusions结论

By using numexpr通过使用numexpr

Quite some speedup is gained when operating on equally indexed DataFrames.在相同索引的 DataFrame 上操作时，可以获得相当多的加速。
The same is true if you have other expressions with a single dataframe, as 2 * df1 .如果您有其他带有单个数据框的表达式，如2 * df1 。
This is not the case if operations between DataFrames with different indices are used.如果使用具有不同索引的 DataFrame 之间的操作，则情况并非如此。
It leads even to completely wrong results if one evaluates expressions containing Pandas DataFrames.如果评估包含 Pandas DataFrames 的表达式，它甚至会导致完全错误的结果。 By chance they can be right.碰巧他们可能是对的。 But numexpression is made for optimizing expressions on Numpy arrays.但是 numexpression 是为优化 Numpy 数组上的表达式而设计的。

如何加速 Pandas 多级数据帧总和？

问题描述

2 个解决方案

解决方案1
8 已采纳 2013-06-30 19:28:48

解决方案2
0 2020-06-17 12:15:43

Other Observations其他观察

Equal Indices on all Frames所有帧上的相同索引

(Slightly) Different Indices on some Frames （略）某些帧上的不同索引

Conclusions结论

如何加速 Pandas 多级数据帧总和？

问题描述

2 个解决方案

解决方案1 8 已采纳 2013-06-30 19:28:48

解决方案2 0 2020-06-17 12:15:43

Other Observations其他观察

Equal Indices on all Frames所有帧上的相同索引

(Slightly) Different Indices on some Frames （略）某些帧上的不同索引

Conclusions结论

解决方案1
8 已采纳 2013-06-30 19:28:48

解决方案2
0 2020-06-17 12:15:43