[英]Why are simple operations on pandas.DataFrames so slow compared to the same operations on numpy.ndarrays?
Why are operations on pandas.DataFrame
s so slow?.为什么
pandas.DataFrame
的操作这么慢? Look at the following examples.请看以下示例。
numpy.ndarray
populated with random floating point numbersnumpy.ndarray
填充随机浮点数pandas.DataFrame
populated with the same numpy arraypandas.DataFrame
填充相同的 numpy 数组The I measure the time of the following operations我测量以下操作的时间
For the numpy.ndarray
对于
numpy.ndarray
For the pandas.DataFrame
对于
pandas.DataFrame
For the pandas.DataFrame.values -> np.ndarray
对于
pandas.DataFrame.values -> np.ndarray
numpy.ndarrays' is much faster then operating on
pandas.DataFrames`.numpy.ndarrays' is much faster then operating on
。pd.DataFrame
does not contain only floating point numbers and has nothing special attached (MultiIndex or whatever).pd.DataFrame
不只包含浮点数并且没有任何特殊附加(MultiIndex 或其他),这甚至是正确的。numpy.ndarray
are about 7 to 10 times faster. numpy.ndarray
上的操作大约快 7 到 10 倍。pandas
not able to call or pass through numpy
s' operations? pandas
是否无法调用或通过numpy
的操作?import numpy as np
import pandas as pd
n = 50000
m = 5000
array = np.random.uniform(0, 1, (n, m))
dataframe = pd.DataFrame(array)
%%timeit
array.sum(axis=0)
206 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
array.sum(axis=1)
233 ms ± 33.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
dataframe.sum(axis=0)
1.65 s ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dataframe.sum(axis=1)
1.74 s ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Let's operate on the values alone...让我们单独对值进行操作......
%%timeit
dataframe.values.sum(axis=0)
206 ms ± 7.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dataframe.values.sum(axis=1)
181 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Pandas uses numpy as its underlying data containers, but provide much more features. Pandas 使用 numpy 作为其底层数据容器,但提供了更多功能。 A DataFrame contains a collection of 1D numpy arrays of possibly different dtypes, along with 2 Index (one for the rows and one for the columns).
一个 DataFrame 包含 1D numpy arrays 的可能不同 dtypes 的集合,以及 2 个索引(一个用于行,一个用于列)。 Those index can even be of MultiIndex types.
这些索引甚至可以是 MultiIndex 类型。
All this comes at a performance cost.所有这些都是以性能为代价的。
The good news is that you can directly process the underlying numpy arrays at numpy level for additional performance if you do not need the fancy indexing of pandas. The good news is that you can directly process the underlying numpy arrays at numpy level for additional performance if you do not need the fancy indexing of pandas.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.