numpy.sum在numpy.array和pandas.DataFrame上的行为有所不同

Question

In short, numpy.sum(a, axis=None) sums all cells of an array, but sums over rows of a data frame. 简而言之， numpy.sum(a, axis=None)对数组的所有单元求和，但对数据帧的行求和。 I thought that pandas.DataFrame is built on top of numpy.array , and should not have this different behavior? 我认为pandas.DataFrame建立在numpy.array ，并且不应该有这种不同的行为吗？ What's the under-the-hood conversion? 什么是后台转换？

a1 = numpy.random.random((3,2))
a2 = pandas.DataFrame(a1)
numpy.sum(a1) # Sums all cells
numpy.sum(a2) # Sums over rows

Answer 1

OK the following is a dump of my pdb debugging session which shows how this ends up in pandas land: 好了，以下是我的pdb调试会话的转储，它显示了它如何在熊猫世界中结束：

In [*]:

a1 = np.random.random((3,2))
import pdb
a2 = pd.DataFrame(a1)
print(np.sum(a1)) # Sums all cells
pdb.set_trace()
np.sum(a2) # Sums over rows
3.02993889742
--Return--
> <ipython-input-50-92405dd4ed52>(5)<module>()->None
-> pdb.set_trace()
(Pdb) b 6
Breakpoint 2 at <ipython-input-50-92405dd4ed52>:6
(Pdb) c
> <ipython-input-50-92405dd4ed52>(6)<module>()->None
-> np.sum(a2) # Sums over rows
(Pdb) s
--Call--
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1623)sum()
-> def sum(a, axis=None, dtype=None, out=None, keepdims=False):
(Pdb) print(axis)
None
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1700)sum()
-> if isinstance(a, _gentype):
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1706)sum()
-> elif type(a) is not mu.ndarray:
(Pdb) sssssss
*** NameError: name 'sssssss' is not defined
(Pdb) ss
*** NameError: name 'ss' is not defined
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1707)sum()
-> try:
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1708)sum()
-> sum = a.sum
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\numpy\core\fromnumeric.py(1713)sum()
-> return sum(axis=axis, dtype=dtype, out=out)
(Pdb) print(axis)
None
(Pdb) s
--Call--
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3973)stat_func()
-> @Substitution(outname=name, desc=desc)
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3977)stat_func()
-> if skipna is None:
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3978)stat_func()
-> skipna = True
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3979)stat_func()
-> if axis is None:
(Pdb) s
> c:\winpython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py(3980)stat_func()
-> axis = self._stat_axis_number
(Pdb) print(self._stat_axis_number)
0
(Pdb)

So basically once it ends up in pandas land there are some integrity checks, one of which is that if axis is None then it's assigned the value from self._stat_axis_number which is 0 , hence the difference in behaviour. 因此，基本上，一旦它进入大熊猫土地，就需要进行一些完整性检查，其中之一是，如果axis is None ，则从self._stat_axis_number为0 ，因此行为上的差异。 I'm not a pandas dev so they may shed more light on this but this explains the difference in output 我不是熊猫开发者，所以他们可能对此有所了解，但这解释了输出的差异

In order to achieve the same output you have to call sum twice: 为了获得相同的输出，您必须调用sum两次：

In [6]:

a2.sum(axis=0).sum()
Out[6]:
3.9180334059883006

Or 要么

In [7]:

np.sum(np.sum(a2))
Out[7]:
3.9180334059883006

numpy.sum在numpy.array和pandas.DataFrame上的行为有所不同

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-03-01 20:46:36

numpy.sum在numpy.array和pandas.DataFrame上的行为有所不同

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-03-01 20:46:36

解决方案1
1 已采纳 2015-03-01 20:46:36