[英]Why is Pandas.eval() with numexpr so slow?
Test code: 测试代码:
import numpy as np
import pandas as pd
COUNT = 1000000
df = pd.DataFrame({
'y': np.random.normal(0, 1, COUNT),
'z': np.random.gamma(50, 1, COUNT),
})
%timeit df.y[(10 < df.z) & (df.z < 50)].mean()
%timeit df.y.values[(10 < df.z.values) & (df.z.values < 50)].mean()
%timeit df.eval('y[(10 < z) & (z < 50)].mean()', engine='numexpr')
The output on my machine (a fairly fast x86-64 Linux desktop with Python 3.6) is: 我的机器上的输出(使用Python 3.6的相当快的x86-64 Linux桌面)是:
17.8 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.44 ms ± 502 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
46.4 ms ± 2.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I understand why the second line is a bit faster (it ignores the Pandas index). 我理解为什么第二行有点快(它忽略了Pandas索引)。 But why is the
eval()
approach using numexpr
so slow? 但是为什么使用
numexpr
的eval()
方法如此之慢? Shouldn't it be faster than at least the first approach? 它不应该比至少第一种方法更快吗? The documentation sure makes it seem like it would be: https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html
文档肯定会使它看起来像: https : //pandas.pydata.org/pandas-docs/stable/enhancingperf.html
From the investigation presented below, it looks like the unspectacular reason for the worse performance is "overhead". 从下面的调查结果来看,表现较差的不明显原因是“开销”。
Only a small part of the expression y[(10 < z) & (z < 50)].mean()
is done via numexpr
-module. 只有表达式
y[(10 < z) & (z < 50)].mean()
一小部分是通过numexpr
-module完成的。 numexpr
doesn't support indexing , thus we can only hope for (10 < z) & (z < 50)
to be speed-up - anything else will be mapped to pandas
-operations. numexpr
不支持索引 ,因此我们只能希望(10 < z) & (z < 50)
加速 - 其他任何东西都将映射到pandas
-operations。
However, (10 < z) & (z < 50)
is not the bottle-neck here, as can be easily seen: 然而,
(10 < z) & (z < 50)
不是瓶颈,可以很容易地看出:
%timeit df.y[(10 < df.z) & (df.z < 50)].mean() # 16.7 ms
mask=(10 < df.z) & (df.z < 50)
%timeit df.y[mask].mean() # 13.7 ms
%timeit df.y[mask] # 13.2 ms
df.y[mask]
-takes the lion's share of the running time. df.y[mask]
占据了大部分的运行时间。
We can compare the profiler output for df.y[mask]
and df.eval('y[mask]')
to see what makes the difference. 我们可以比较
df.y[mask]
和df.eval('y[mask]')
的探查器输出,看看有什么区别。
When I use the following script: 当我使用以下脚本时:
import numpy as np
import pandas as pd
COUNT = 1000000
df = pd.DataFrame({
'y': np.random.normal(0, 1, COUNT),
'z': np.random.gamma(50, 1, COUNT),
})
mask = (10 < df.z) & (df.z < 50)
df['m']=mask
for _ in range(500):
df.y[df.m]
# OR
#df.eval('y[m]', engine='numexpr')
and run it with python -m cProfile -s cumulative run.py
(or %prun -s cumulative <...>
in IPython), I can see the following profiles. 并使用
python -m cProfile -s cumulative run.py
(或IPython中的%prun -s cumulative <...>
)运行它,我可以看到以下配置文件。
For direct call of the pandas functionality: 直接调用pandas功能:
ncalls tottime percall cumtime percall filename:lineno(function)
419/1 0.013 0.000 7.228 7.228 {built-in method builtins.exec}
1 0.006 0.006 7.228 7.228 run.py:1(<module>)
500 0.005 0.000 6.589 0.013 series.py:764(__getitem__)
500 0.003 0.000 6.475 0.013 series.py:812(_get_with)
500 0.003 0.000 6.468 0.013 series.py:875(_get_values)
500 0.009 0.000 6.445 0.013 internals.py:4702(get_slice)
500 0.006 0.000 3.246 0.006 range.py:491(__getitem__)
505 3.146 0.006 3.236 0.006 base.py:2067(__getitem__)
500 3.170 0.006 3.170 0.006 internals.py:310(_slice)
635/2 0.003 0.000 0.414 0.207 <frozen importlib._bootstrap>:958(_find_and_load)
We can see that almost 100% of the time is spent in series.__getitem__
without any overhead. 我们可以看到,几乎100%的时间是在
series.__getitem__
花费的series.__getitem__
没有任何开销。
For the call via df.eval(...)
, the situation is quite different: 对于通过
df.eval(...)
的调用,情况完全不同:
ncalls tottime percall cumtime percall filename:lineno(function)
453/1 0.013 0.000 12.702 12.702 {built-in method builtins.exec}
1 0.015 0.015 12.702 12.702 run.py:1(<module>)
500 0.013 0.000 12.090 0.024 frame.py:2861(eval)
1000/500 0.025 0.000 10.319 0.021 eval.py:153(eval)
1000/500 0.007 0.000 9.247 0.018 expr.py:731(__init__)
1000/500 0.004 0.000 9.236 0.018 expr.py:754(parse)
4500/500 0.019 0.000 9.233 0.018 expr.py:307(visit)
1000/500 0.003 0.000 9.105 0.018 expr.py:323(visit_Module)
1000/500 0.002 0.000 9.102 0.018 expr.py:329(visit_Expr)
500 0.011 0.000 9.096 0.018 expr.py:461(visit_Subscript)
500 0.007 0.000 6.874 0.014 series.py:764(__getitem__)
500 0.003 0.000 6.748 0.013 series.py:812(_get_with)
500 0.004 0.000 6.742 0.013 series.py:875(_get_values)
500 0.009 0.000 6.717 0.013 internals.py:4702(get_slice)
500 0.006 0.000 3.404 0.007 range.py:491(__getitem__)
506 3.289 0.007 3.391 0.007 base.py:2067(__getitem__)
500 3.282 0.007 3.282 0.007 internals.py:310(_slice)
500 0.003 0.000 1.730 0.003 generic.py:432(_get_index_resolvers)
1000 0.014 0.000 1.725 0.002 generic.py:402(_get_axis_resolvers)
2000 0.018 0.000 1.685 0.001 base.py:1179(to_series)
1000 0.003 0.000 1.537 0.002 scope.py:21(_ensure_scope)
1000 0.014 0.000 1.534 0.002 scope.py:102(__init__)
500 0.005 0.000 1.476 0.003 scope.py:242(update)
500 0.002 0.000 1.451 0.003 inspect.py:1489(stack)
500 0.021 0.000 1.449 0.003 inspect.py:1461(getouterframes)
11000 0.062 0.000 1.415 0.000 inspect.py:1422(getframeinfo)
2000 0.008 0.000 1.276 0.001 base.py:1253(_to_embed)
2035 1.261 0.001 1.261 0.001 {method 'copy' of 'numpy.ndarray' objects}
1000 0.015 0.000 1.226 0.001 engines.py:61(evaluate)
11000 0.081 0.000 1.081 0.000 inspect.py:757(findsource)
once again about 7 seconds are spent in series.__getitem__
, but there are also about 6 seconds overhead - for example about 2 seconds in frame.py:2861(eval)
and about 2 seconds in expr.py:461(visit_Subscript)
. series.__getitem__
再次花费大约7秒series.__getitem__
,但也有大约6秒的开销 - 例如在frame.py:2861(eval)
大约2秒,在expr.py:461(visit_Subscript)
大约2秒。
I did only a superficial investigation (see more details further below), but this overhead doesn't seems to be just constant but at least linear in the number of element in the series. 我只做了一个肤浅的调查(详见下面的更多详细信息),但这个开销似乎不仅仅是常数,而是至少在系列中的元素数量上是线性的。 For example there is
method 'copy' of 'numpy.ndarray' objects
which means that data is copied (it is quite unclear, why this would be necessary per se). 例如,有
method 'copy' of 'numpy.ndarray' objects
这意味着数据被复制(很不清楚,为什么这本身就是必要的)。
My take-away from it: using pd.eval
has advantages as long as the evaluated expression can be evaluated with numexpr
alone. 我从中获取:使用
pd.eval
具有优势,只要评估的表达式可以仅使用numexpr
进行评估。 As soon as this is not the case, there might be no longer gains but losses due to quite large overhead. 一旦情况不是这样,由于相当大的开销,可能不会再有收益而是损失。
Using line_profiler
(here I use %lprun-magic (after loading it with %load_ext line_profliler
) for the function run()
which is more or less a copy from the script above) we can easily find where the time is lost in Frame.eval
: 使用
line_profiler
(这里我使用%lprun-magic(在用%load_ext line_profliler
加载之后)为函数run()
这或多或少是上面脚本的副本)我们可以很容易地找到Frame.eval
丢失时间的Frame.eval
:
%lprun -f pd.core.frame.DataFrame.eval
-f pd.core.frame.DataFrame._get_index_resolvers
-f pd.core.frame.DataFrame._get_axis_resolvers
-f pd.core.indexes.base.Index.to_series
-f pd.core.indexes.base.Index._to_embed
run()
Here we can see were the additional 10% are spent: 在这里,我们可以看到额外的10%花费:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2861 def eval(self, expr,
....
2951 10 206.0 20.6 0.0 from pandas.core.computation.eval import eval as _eval
2952
2953 10 176.0 17.6 0.0 inplace = validate_bool_kwarg(inplace, 'inplace')
2954 10 30.0 3.0 0.0 resolvers = kwargs.pop('resolvers', None)
2955 10 37.0 3.7 0.0 kwargs['level'] = kwargs.pop('level', 0) + 1
2956 10 17.0 1.7 0.0 if resolvers is None:
2957 10 235850.0 23585.0 9.0 index_resolvers = self._get_index_resolvers()
2958 10 2231.0 223.1 0.1 resolvers = dict(self.iteritems()), index_resolvers
2959 10 29.0 2.9 0.0 if 'target' not in kwargs:
2960 10 19.0 1.9 0.0 kwargs['target'] = self
2961 10 46.0 4.6 0.0 kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
2962 10 2392725.0 239272.5 90.9 return _eval(expr, inplace=inplace, **kwargs)
and _get_index_resolvers()
can be drilled down to Index._to_embed
: 和
_get_index_resolvers()
可以深入到Index._to_embed
:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1253 def _to_embed(self, keep_tz=False, dtype=None):
1254 """
1255 *this is an internal non-public method*
1256
1257 return an array repr of this object, potentially casting to object
1258
1259 """
1260 40 73.0 1.8 0.0 if dtype is not None:
1261 return self.astype(dtype)._to_embed(keep_tz=keep_tz)
1262
1263 40 201490.0 5037.2 100.0 return self.values.copy()
Where the O(n)
-copying happens. O(n)
- 复制发生的地方。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.