简体   繁体   English

对于相同的数据,为什么pandas.DataFrame.cov()方法比numpy.dot(…)快几个数量级?

[英]Why is the pandas.DataFrame.cov() method orders of magnitude faster than numpy.dot(…) for the same data?

I was calculating covariance run times like this in ipython 我正在ipython中计算像这样的协方差运行时间

>>> from pandas import DataFrame
>>> import numpy as np
>>> # create data frame set
>>> df = get_data()
>>> df.shape
(4795, 1000)
>>> %timeit df.cov()
10 loops, best of 3: 99.5 ms per loop
>>> mat = np.matrix(df.values)
>>> %timeit np.dot(mat.transpose(), mat)
1 loops, best of 3: 1min per loop

So, I've figured out the why of the observed speed difference. 因此,我想出了观察到的速度差异的原因。 . . but not the why of the why. 但不是为什么的原因。 I'll update when I find that. 找到后,我会更新。

This is the answer to: "Why is the DataFrame.cov method so much faster than converting to a numpy matrix and using the np.cov or np.dot method?" 答案是:“为什么DataFrame.cov方法比转换为numpy矩阵并使用np.cov或np.dot方法要快得多?”

The DataFrame data type was int64. DataFrame数据类型为int64。 When it was converted to a numpy matrix using 使用以下命令将其转换为numpy矩阵时

mat = np.matrix(df.to_matrix())

The resulting 'mat' object is also of type int64. 生成的“ mat”对象也是int64类型。

Under the hood, the DataFrame.cov method converts its matrix to float64 before calling numpy's covariance method. 在后台,DataFrame.cov方法在调用numpy的协方差方法之前将其矩阵转换为float64。

When running timeit's on numpy ndarrays or matrices of dtype int64, you see the same performance lag. 在numpy ndarray或dtype int64的矩阵上运行timeit时,您会看到相同的性能滞后。 On my machine with a dataset of shape (16497, 5000) int64 operations do not complete and sometimes crash with memory errors. 在我的形状为(16497,5000)数据集的计算机上,int64操作无法完成,有时会因内存错误而崩溃。 float64 completes in seconds. float64在几秒钟内完成。

So, the short answer is the reason the numpy.dot method above is slower than the DataFrame.cov one is the datatypes were different. 因此,简短的答案是上面的numpy.dot方法比DataFrame.cov慢的原因之一是数据类型不同。

I'm going to investigate why this gap exists. 我将调查为什么存在这种差距。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM