Is there a difference (in performance for example) when comparing shape
and len
? Consider the following example:
In [1]: import numpy as np
In [2]: a = np.array([1,2,3,4])
In [3]: a.shape
Out[3]: (4,)
In [4]: len(a)
Out[4]: 4
Quick runtime comparison suggests that there's no difference:
In [17]: a = np.random.randint(0,10000, size=1000000)
In [18]: %time a.shape
CPU times: user 6 µs, sys: 2 µs, total: 8 µs
Wall time: 13.1 µs
Out[18]: (1000000,)
In [19]: %time len(a)
CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 9.06 µs
Out[19]: 1000000
So, what is the difference and which one is more pythonic? (I guess using shape
).
I wouldn't worry about performance here - any differences should only be very marginal.
I'd say the more pythonic alternative is probably the one which matches your needs more closely:
a.shape
may contain more information than len(a)
since it contains the size along all axes whereas len
only returns the size along the first axis:
>>> a = np.array([[1,2,3,4], [1,2,3,4]])
>>> len(a)
2
>>> a.shape
(2L, 4L)
If you actually happen to work with one-dimensional arrays only, than I'd personally favour using len(a)
in case you explicitly need the array's size.
From the source code, it looks like shape basically uses len()
: https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py
@property
def shape(self) -> Tuple[int, int]:
return len(self.index), len(self.columns)
def __len__(self) -> int:
return len(self.index)
Calling shape will attempt to run both dim calcs. So maybe df.shape[0] + df.shape[1]
is slower than len(df.index) + len(df.columns)
. Still, performance-wise, the difference should be negligible except for a giant giant 2D dataframe.
So in line with the previous answers, df.shape
is good if you need both dimensions, for a single dimension, len()
seems more appropriate conceptually.
Looking at property vs method answers, it all points to usability and readability of code. So again, in your case, I would say if you want information about the whole dataframe just to check or for example to pass the shape tuple to a function, use shape
. For a single column, including index (ie the rows of a df), use len()
.
There is really (very small) a different. If you work on time-series data and know that the data is vector (1D), use len
as it is faster, and make it habit, even if it is just very-very marginal. Bish's answer already explained what happens behind the scene.
Proper benchmark using %%timeit
(I test is several times) resulting in len
as the victor:
# tested on pandas DataFrame
%%timeit
len(yhat.values)
# 576 ns ± 1.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
yhat.values.shape[0]
# 607 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Furthermore, in 1D, len
as length is more informative (when you read a code) than .shape[0]
.
For 1D case, both len and shape will produce same result. For other case, I shape will provide more information. It depends on program to program in which will provide you better performance. I suggest you to not to worry much about performance.
import numpy as np
x = np.linspace(1, 10, 10).reshape((5, 2))
print(x)
print(x.size)
print(len(x))
gives the following output:
[[ 1. 2.]
[ 3. 4.]
[ 5. 6.]
[ 7. 8.]
[ 9. 10.]]
10
5
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.