简体   繁体   中英

shape vs len for numpy array

Is there a difference (in performance for example) when comparing shape and len ? Consider the following example:

In [1]: import numpy as np

In [2]: a = np.array([1,2,3,4])

In [3]: a.shape
Out[3]: (4,)

In [4]: len(a)
Out[4]: 4

Quick runtime comparison suggests that there's no difference:

In [17]: a = np.random.randint(0,10000, size=1000000)

In [18]: %time a.shape
CPU times: user 6 µs, sys: 2 µs, total: 8 µs
Wall time: 13.1 µs
Out[18]: (1000000,)

In [19]: %time len(a)
CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 9.06 µs
Out[19]: 1000000

So, what is the difference and which one is more pythonic? (I guess using shape ).

I wouldn't worry about performance here - any differences should only be very marginal.

I'd say the more pythonic alternative is probably the one which matches your needs more closely:

a.shape may contain more information than len(a) since it contains the size along all axes whereas len only returns the size along the first axis:

>>> a = np.array([[1,2,3,4], [1,2,3,4]])
>>> len(a)
2
>>> a.shape
(2L, 4L)

If you actually happen to work with one-dimensional arrays only, than I'd personally favour using len(a) in case you explicitly need the array's size.

From the source code, it looks like shape basically uses len() : https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py

@property
def shape(self) -> Tuple[int, int]:
    return len(self.index), len(self.columns)
def __len__(self) -> int:
    return len(self.index)

Calling shape will attempt to run both dim calcs. So maybe df.shape[0] + df.shape[1] is slower than len(df.index) + len(df.columns) . Still, performance-wise, the difference should be negligible except for a giant giant 2D dataframe.

So in line with the previous answers, df.shape is good if you need both dimensions, for a single dimension, len() seems more appropriate conceptually.

Looking at property vs method answers, it all points to usability and readability of code. So again, in your case, I would say if you want information about the whole dataframe just to check or for example to pass the shape tuple to a function, use shape . For a single column, including index (ie the rows of a df), use len() .

There is really (very small) a different. If you work on time-series data and know that the data is vector (1D), use len as it is faster, and make it habit, even if it is just very-very marginal. Bish's answer already explained what happens behind the scene.

Proper benchmark using %%timeit (I test is several times) resulting in len as the victor:

# tested on pandas DataFrame

%%timeit
len(yhat.values)
# 576 ns ± 1.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%%timeit
yhat.values.shape[0]
# 607 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Furthermore, in 1D, len as length is more informative (when you read a code) than .shape[0] .

For 1D case, both len and shape will produce same result. For other case, I shape will provide more information. It depends on program to program in which will provide you better performance. I suggest you to not to worry much about performance.

import numpy as np

x = np.linspace(1, 10, 10).reshape((5, 2))
print(x)
print(x.size)
print(len(x))

gives the following output:

[[ 1.  2.]
 [ 3.  4.]
 [ 5.  6.]
 [ 7.  8.]
 [ 9. 10.]]
10
5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM