為什么len在DataFrame上比在底層numpy數組上效率更高？

Question

我已經注意到，使用len在數據幀是遠遠超過使用更快len底層numpy的陣列上。 我不明白為什么。 通過shape訪問相同的信息也沒有任何幫助。 當我嘗試獲取列數和行數時，這更相關。 我一直在爭論使用哪種方法。

我把以下實驗放在一起，很明顯我將在數據幀上使用len 。 但有人可以解釋為什么嗎？

from timeit import timeit
import pandas as pd
import numpy as np

ns = np.power(10, np.arange(6))
results = pd.DataFrame(
    columns=ns,
    index=pd.MultiIndex.from_product(
        [['len', 'len(values)', 'shape'],
         ns]))
dfs = {(n, m): pd.DataFrame(np.zeros((n, m))) for n in ns for m in ns}

for n, m in dfs.keys():
    df = dfs[(n, m)]
    results.loc[('len', n), m] = timeit('len(df)', 'from __main__ import df', number=10000)
    results.loc[('len(values)', n), m] = timeit('len(df.values)', 'from __main__ import df', number=10000)
    results.loc[('shape', n), m] = timeit('df.values.shape', 'from __main__ import df', number=10000)


fig, axes = plt.subplots(2, 3, figsize=(9, 6), sharex=True, sharey=True)
for i, (m, col) in enumerate(results.iteritems()):
    r, c = i // 3, i % 3
    col.unstack(0).plot.bar(ax=axes[r, c], title=m)

Answer 1

從查看各種方法來看，主要原因是構造numpy數組df.values占用了大部分時間 。

`len(df)`和`df.shape`

這兩個很快，因為它們本質上是

len(df.index._data)

和

(len(df.index._data), len(df.columns._data))

其中_data是numpy.ndarray 。 因此，使用df.shape應一半一樣快， len(df)因為它的發現兩者的長度df.index和df.columns （兩者類型的pd.Index ）

`len(df.values)`和`df.values.shape`

假設您已經提取了vals = df.values 。 然后

In [1]: df = pd.DataFrame(np.random.rand(1000, 10), columns=range(10))

In [2]: vals = df.values

In [3]: %timeit len(vals)
10000000 loops, best of 3: 35.4 ns per loop

In [4]: %timeit vals.shape
10000000 loops, best of 3: 51.7 ns per loop

相比：

In [5]: %timeit len(df.values)
100000 loops, best of 3: 3.55 µs per loop

所以瓶頸不是len而是df.values是如何構建的。 如果你檢查pandas.DataFrame.values() ，你會發現（大致相當的）方法：

def values(self):
    return self.as_matrix()

def as_matrix(self, columns=None):
    self._consolidate_inplace()
    if self._AXIS_REVERSED:
        return self._data.as_matrix(columns).T

    if len(self._data.blocks) == 0:
        return np.empty(self._data.shape, dtype=float)

    if columns is not None:
        mgr = self._data.reindex_axis(columns, axis=0)
    else:
        mgr = self._data

    if self._data._is_single_block or not self._data.is_mixed_type:
        return mgr.blocks[0].get_values()
    else:
        dtype = _interleaved_dtype(self.blocks)
        result = np.empty(self.shape, dtype=dtype)
        if result.shape[0] == 0:
            return result

        itemmask = np.zeros(self.shape[0])
        for blk in self.blocks:
            rl = blk.mgr_locs
            result[rl.indexer] = blk.get_values(dtype)
            itemmask[rl.indexer] = 1

        # vvv here is your final array assuming you actually have data
        return result 

def _consolidate_inplace(self):
    def f():
        if self._data.is_consolidated():
            return self._data

        bm = self._data.__class__(self._data.blocks, self._data.axes)
        bm._is_consolidated = False
        bm._consolidate_inplace()
        return bm
    self._protect_consolidate(f)

def _protect_consolidate(self, f):
    blocks_before = len(self._data.blocks)
    result = f()
    if len(self._data.blocks) != blocks_before:
        if i is not None:
            self._item_cache.pop(i, None)
        else:
            self._item_cache.clear()
    return result

請注意， df._data是pandas.core.internals.BlockManager ，而不是numpy.ndarray 。

Answer 2

如果你看一下pd.DataFrame __len__ ，它們實際上只是調用len(df.index) ： https ： //github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L770

對於RangeIndex ，這是一個非常快速的操作，因為它只是存儲在索引對象中的值的減法和除法：

return max(0, -(-(self._stop - self._start) // self._step))

https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/range.py#L458

我懷疑如果你使用非RangeIndex進行測試，時間的差異會更加相似。 如果是這樣的話，我可能會嘗試修改你要看的東西。

編輯：經過快速檢查后，即使使用標准Index ，速度差異似乎仍然存在，因此必須進行其他一些優化。

為什么len在DataFrame上比在底層numpy數組上效率更高？

問題描述

2 個解決方案

解決方案1
4 已采納 2016-12-07 02:18:03

`len(df)`和`df.shape`

`len(df.values)`和`df.values.shape`

解決方案2
2 2016-12-07 01:55:04

為什么len在DataFrame上比在底層numpy數組上效率更高？

問題描述

2 個解決方案

解決方案1 4 已采納 2016-12-07 02:18:03

len(df)和df.shape

len(df.values)和df.values.shape

解決方案2 2 2016-12-07 01:55:04

解決方案1
4 已采納 2016-12-07 02:18:03

`len(df)`和`df.shape`

`len(df.values)`和`df.values.shape`

解決方案2
2 2016-12-07 01:55:04