为什么groupby.diff这么慢？

Question

I want to compute diff of a series per each group, something as following example: 我想计算每个组的序列差异，如下例所示：

In [24]: rnd_ser = pd.Series(np.random.randn(5000))
    ...: com_ser = pd.concat([rnd_ser] * 500, keys=np.arange(500), names=['Date', 'ID'])

In [25]: d1 = com_ser.groupby("Date").diff()

In [26]: d2 = com_ser - com_ser.groupby("Date").shift()

In [27]: np.allclose(d1.fillna(0), d2.fillna(0))
Out[27]: True

There are two ways to solve this problem, however, the first one has badly performance: 有两种方法可以解决此问题，但是，第一种方法的性能很差：

In [30]: %timeit d1 = com_ser.groupby("Date").diff()
616 ms ± 5.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [31]: %timeit d2 = com_ser - com_ser.groupby("Date").shift()
95 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Is this expected or a bug? 这是预期的还是错误？

The details of my env: 我的环境细节：

In [23]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.9.3
pip: 18.1
setuptools: 40.5.0
Cython: 0.29
numpy: 1.15.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: 1.8.1
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.12
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Answer 1

FWIW, I am seeing similar numbers on my machine FWIW，我在机器上看到相似的数字

%timeit d1 = com_ser.groupby("Date").diff()
523 ms ± 32.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit d2 = com_ser - com_ser.groupby("Date").shift()
80.8 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas implementation of diff() seems to be slow with the groupby() 使用groupby() ， diff() Pandas实现似乎很慢

For example if I make one big series 例如，如果我做了一个大系列

big_ser = pd.Series(np.random.randn(int(1e7)))

then compare a shift and subtract versus the Series.diff() 然后比较位移和与Series.diff()相减

%timeit big_ser - big_ser.shift()
46.3 ms ± 789 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit big_ser.diff()
41.6 ms ± 488 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Then the times are identical between the implementations. 这样，实现之间的时间是相同的。 This follows, as you look the internal source code for Series.diff it says explicitly in the comments 接下来，当您查看Series.diff的内部源代码时，它在注释中明确指出

def diff(arr, n, axis=0):
    """
    difference of n between self,
    analogous to s-s.shift(n)

So I think its gotta be some overhead in the groupby specific to diff() 所以我认为它必须是diff()特定的groupby一些开销

为什么groupby.diff这么慢？

问题描述

1 个解决方案

解决方案1
0 2018-11-05 17:17:27

为什么groupby.diff这么慢？

问题描述

1 个解决方案

解决方案1 0 2018-11-05 17:17:27

解决方案1
0 2018-11-05 17:17:27