为什么 pandas.DataFrames 上的简单操作与 numpy.ndarrays 上的相同操作相比如此缓慢？

Question

Why are operations on pandas.DataFrame s so slow?.为什么pandas.DataFrame的操作这么慢？ Look at the following examples.请看以下示例。

Measurement:测量：

Create a numpy.ndarray populated with random floating point numbers创建一个numpy.ndarray填充随机浮点数
Create a pandas.DataFrame populated with the same numpy array创建一个pandas.DataFrame填充相同的 numpy 数组

The I measure the time of the following operations我测量以下操作的时间

For the numpy.ndarray对于numpy.ndarray
- Take the sum along the 0-axis沿 0 轴取和
- Take the sum along the 1-axis沿 1 轴求和
For the pandas.DataFrame对于pandas.DataFrame
- Take the sum along the 0-axis沿 0 轴取和
- Take the sum along the 1-axis沿 1 轴求和
For the pandas.DataFrame.values -> np.ndarray对于pandas.DataFrame.values -> np.ndarray
- Take the sum along the 0-axis沿 0 轴取和
- Take the sum along the 1-axis沿 1 轴求和

Observations观察

Summing over numpy.ndarrays' is much faster then operating on pandas.DataFrames`.对 numpy.ndarrays 求和比在numpy.ndarrays' is much faster then operating on 。
This is even true, if the pd.DataFrame does not contain only floating point numbers and has nothing special attached (MultiIndex or whatever).如果pd.DataFrame不只包含浮点数并且没有任何特殊附加（MultiIndex 或其他），这甚至是正确的。
Operations on numpy.ndarray are about 7 to 10 times faster. numpy.ndarray上的操作大约快 7 到 10 倍。

Questions问题

Why does this happen?为什么会这样？
How can this be optimized?如何优化？
Is pandas not able to call or pass through numpy s' operations? pandas是否无法调用或通过numpy的操作？

import numpy as np
import pandas as pd

n = 50000
m = 5000
array = np.random.uniform(0, 1, (n, m))
dataframe = pd.DataFrame(array)

Numpy Numpy

%%timeit
array.sum(axis=0)

206 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
array.sum(axis=1)

233 ms ± 33.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas Pandas

%%timeit
dataframe.sum(axis=0)

1.65 s ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
dataframe.sum(axis=1)

1.74 s ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pandas without Pandas Pandas 无 Pandas

Let's operate on the values alone...让我们单独对值进行操作......

%%timeit
dataframe.values.sum(axis=0)

206 ms ± 7.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
dataframe.values.sum(axis=1)

181 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answer 1

Pandas uses numpy as its underlying data containers, but provide much more features. Pandas 使用 numpy 作为其底层数据容器，但提供了更多功能。 A DataFrame contains a collection of 1D numpy arrays of possibly different dtypes, along with 2 Index (one for the rows and one for the columns).一个 DataFrame 包含 1D numpy arrays 的可能不同 dtypes 的集合，以及 2 个索引（一个用于行，一个用于列）。 Those index can even be of MultiIndex types.这些索引甚至可以是 MultiIndex 类型。

All this comes at a performance cost.所有这些都是以性能为代价的。

The good news is that you can directly process the underlying numpy arrays at numpy level for additional performance if you do not need the fancy indexing of pandas. The good news is that you can directly process the underlying numpy arrays at numpy level for additional performance if you do not need the fancy indexing of pandas.

为什么 pandas.DataFrames 上的简单操作与 numpy.ndarrays 上的相同操作相比如此缓慢？

问题描述

Measurement:测量：

Observations观察

Questions问题

Numpy Numpy

Pandas Pandas

Pandas without Pandas Pandas 无 Pandas

1 个解决方案

解决方案1
0 2020-05-27 13:33:14

为什么 pandas.DataFrames 上的简单操作与 numpy.ndarrays 上的相同操作相比如此缓慢？

问题描述

Measurement:测量：

Observations观察

Questions问题

Numpy Numpy

Pandas Pandas

Pandas without Pandas Pandas 无 Pandas

1 个解决方案

解决方案1 0 2020-05-27 13:33:14

解决方案1
0 2020-05-27 13:33:14