[英]How to speed up Pandas multilevel dataframe sum?
I am trying to speed up the sum for several big multilevel dataframes.我正在尝试加快几个大型多级数据帧的总和。
Here is a sample:这是一个示例:
df1 = mul_df(5000,30,400) # mul_df to create a big multilevel dataframe
#let df2, df3, df4 = df1, df1, df1 to minimize the memory usage,
#they can also be mul_df(5000,30,400)
df2, df3, df4 = df1, df1, df1
In [12]: timeit df1+df2+df3+df4
1 loops, best of 3: 993 ms per loop
I am not satisfy with the 993ms, Is there any way to speed up ?我对 993ms 不满意,有什么办法可以加快速度吗? Can cython improve the performance ?
cython 可以提高性能吗? If yes, how to write the cython code ?
如果是,如何编写 cython 代码? Thanks.
谢谢。
Note : mul_df()
is the function to create the demo multilevel dataframe.注意:
mul_df()
是创建演示多级数据帧的函数。
import itertools
import numpy as np
import pandas as pd
def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
''' create multilevel dataframe, for example: mul_df(4,2,6)'''
index_name = ['STK_ID','RPT_Date']
col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]
first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum
dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
dt[index_name[0]] = first_level_dt
dt[index_name[1]] = second_level_dt
rst = dt.set_index(index_name, drop=True, inplace=False)
return rst
Update:更新:
Data on my Pentium Dual-Core T4200@2.00GHZ, 3.00GB RAM, WindowXP, Python 2.7.4, Numpy 1.7.1, Pandas 0.11.0, numexpr 2.0.1 (Anaconda 1.5.0 (32-bit))我的 Pentium Dual-Core T4200@2.00GHZ、3.00GB RAM、WindowXP、Python 2.7.4、Numpy 1.7.1、Pandas 0.11.0、numexpr 2.0.1(Anaconda 1.5.0(32 位))上的数据
In [1]: from pandas.core import expressions as expr
In [2]: import numexpr as ne
In [3]: df1 = mul_df(5000,30,400)
In [4]: df2, df3, df4 = df1, df1, df1
In [5]: expr.set_use_numexpr(False)
In [6]: %timeit df1+df2+df3+df4
1 loops, best of 3: 1.06 s per loop
In [7]: expr.set_use_numexpr(True)
In [8]: %timeit df1+df2+df3+df4
1 loops, best of 3: 986 ms per loop
In [9]: %timeit DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32')
1 loops, best of 3: 388 ms per loop
method 1: On my machine not so bad (with numexpr
disabled)方法 1:在我的机器上还不错(禁用
numexpr
)
In [41]: from pandas.core import expressions as expr
In [42]: expr.set_use_numexpr(False)
In [43]: %timeit df1+df2+df3+df4
1 loops, best of 3: 349 ms per loop
method 2: Using numexpr
(which is by default enabled if numexpr
is installed)方法2:使用
numexpr
(这是默认启用如果numexpr
安装)
In [44]: expr.set_use_numexpr(True)
In [45]: %timeit df1+df2+df3+df4
10 loops, best of 3: 173 ms per loop
method 3: Direct use of numexpr
方法三:直接使用
numexpr
In [34]: import numexpr as ne
In [46]: %timeit DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32')
10 loops, best of 3: 47.7 ms per loop
These speedups are achieved using numexpr
because:这些加速是使用
numexpr
实现的,因为:
((df1+df2)+df3)+df4
((df1+df2)+df3)+df4
这样的评估As I hinted above, pandas uses numexpr
under the hood for certain types of ops (in 0.11), eg df1 + df2
would be evaluated this way, however the example you are giving here will result in several calls to numexpr
(this is method 2 is faster than method 1.).正如我上面所暗示的,pandas 在
numexpr
使用numexpr
来处理某些类型的操作(在 0.11 中),例如df1 + df2
将通过这种方式进行评估,但是您在此处给出的示例将导致多次调用numexpr
(这是方法 2比方法 1 快。)。 Using the direct (method 3) ne.evaluate(...)
achieves even more speedups.使用直接(方法 3)
ne.evaluate(...)
可以实现更多的加速。
Note that in pandas 0.13 (0.12 will be released this week), we are implemented a function pd.eval
which will in effect do exactly what my example above does.请注意,在 Pandas 0.13(本周将发布 0.12)中,我们实现了一个函数
pd.eval
,它实际上将完全按照我上面的示例所做的。 Stay tuned (if you are adventurous this will be in master somewhat soon: https://github.com/pydata/pandas/pull/4037 )请继续关注(如果您喜欢冒险,这将很快掌握: https : //github.com/pydata/pandas/pull/4037 )
In [5]: %timeit pd.eval('df1+df2+df3+df4')
10 loops, best of 3: 50.9 ms per loop
Lastly to answer your question, cython
will not help here at all;最后回答你的问题,
cython
在这里根本没有帮助; numexpr
is quite efficient at this type of problem (that said, there are situation where cython is helpful) numexpr
在这类问题上非常有效(也就是说,在某些情况下 cython 很有帮助)
One caveat: in order to use the direct Numexpr method the frames should be already aligned (Numexpr operates on the numpy array and doesn't know anything about the indices).一个警告:为了使用直接 Numexpr 方法,帧应该已经对齐(Numexpr 对 numpy 数组进行操作并且对索引一无所知)。 also they should be a single dtype
他们也应该是一个单一的数据类型
In the following my Observations: - First, I reproduce your test case and come to other results.在以下我的观察中: - 首先,我重现了您的测试用例并得出了其他结果。 Using numexpression under the hood of Pandas increases performance significantly.
在 Pandas 下使用 numexpression 可以显着提高性能。 - Second, I sort one of the four DataFrames in descending order and rerun all cases.
- 其次,我按降序对四个 DataFrame 之一进行排序,然后重新运行所有案例。 The performance breaks, and additionally, (as expected) numexpression evaluation on Pandas DataFrames leads to wrong results.
性能中断,此外(如预期的那样)对 Pandas DataFrames 的 numexpression 评估会导致错误的结果。
This case reproduces your case.这个案例再现了你的案例。 The only difference is, that I create copies of the inital DataFrame instance.
唯一的区别是,我创建了初始 DataFrame 实例的副本。 So there is nothing shared.
所以没有什么可共享的。 There are different objects (ids) in use to make sure, that numexpression can deal with it.
使用不同的对象(id)来确保 numexpression 可以处理它。
import itertools
import numpy as np
import pandas as pd
def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
''' create multilevel dataframe, for example: mul_df(4,2,6)'''
index_name = ['STK_ID','RPT_Date']
col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]
first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum
dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
dt[index_name[0]] = first_level_dt
dt[index_name[1]] = second_level_dt
rst = dt.set_index(index_name, drop=True, inplace=False)
return rst
df1 = mul_df(5000,30,400)
df2, df3, df4 = df1.copy(), df1.copy(), df1.copy()
pd.options.compute.use_numexpr = False
%%timeit
df1 + df2 + df3 + df4
564 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pd.options.compute.use_numexpr = True
%%timeit
df1 + df2 + df3 + df4
152 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
import numexpr as ne
%%timeit
pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32')
66.4 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
(df1 + df2 + df3 + df4).equals(pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32'))
True
Here I sort one of the DataFrames in descending order, therefore changing the index and reshuffling the rows in the dataframe internal numpy array.在这里,我按降序对 DataFrame 之一进行排序,因此更改索引并重新排列数据帧内部 numpy 数组中的行。
import itertools
import numpy as np
import pandas as pd
def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
''' create multilevel dataframe, for example: mul_df(4,2,6)'''
index_name = ['STK_ID','RPT_Date']
col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]
first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum
dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
dt[index_name[0]] = first_level_dt
dt[index_name[1]] = second_level_dt
rst = dt.set_index(index_name, drop=True, inplace=False)
return rst
df1 = mul_df(5000,30,400)
df2, df3, df4 = df1.copy(), df1.copy(), df1.copy().sort_index(ascending=False)
pd.options.compute.use_numexpr = False
%%timeit
df1 + df2 + df3 + df4
1.36 s ± 67.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pd.options.compute.use_numexpr = True
%%timeit
df1 + df2 + df3 + df4
928 ms ± 39.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
import numexpr as ne
%%timeit
pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32')
68 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
(df1 + df2 + df3 + df4).equals(pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32'))
False
By using numexpr
通过使用
numexpr
2 * df1
.2 * df1
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.