[英]How do I access a numpy array as quickly as a pandas dataframe
I ran a comparison of several ways to access data in a DataFrame
. 我对几种访问
DataFrame
数据的方法进行了比较。 See results below. 见下面的结果。 The quickest access was from using the
get_value
method on a DataFrame
. 最快的访问是在
DataFrame
上使用get_value
方法。 I was referred to this on this post . 我在这篇文章中提到了这一点。
What I was surprised by is that the access via get_value
is quicker than accessing via the underlying numpy object df.values
. 令我惊讶的是,通过
get_value
访问比通过底层numpy对象df.values
访问更快。
My question is, is there a way to access elements of a numpy array as quickly as I can access a pandas dataframe via get_value
? 我的问题是,有没有办法像我可以通过
get_value
访问pandas数据帧一样快速访问numpy数组的元素?
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(16).reshape(4, 4))
%%timeit
df.iloc[2, 2]
10000 loops, best of 3: 108 µs per loop
10000循环,最佳3:每循环108μs
%%timeit
df.values[2, 2]
The slowest run took 5.42 times longer than the fastest.
最慢的运行时间比最快的运行时长5.42倍。 This could mean that an intermediate result is being cached.
这可能意味着正在缓存中间结果。 100000 loops, best of 3: 8.02 µs per loop
100000次循环,最佳3:每循环8.02μs
%%timeit
df.iat[2, 2]
The slowest run took 4.96 times longer than the fastest.
最慢的运行时间比最快的运行时长4.96倍。 This could mean that an intermediate result is being cached.
这可能意味着正在缓存中间结果。 100000 loops, best of 3: 9.85 µs per loop
100000个循环,最佳3:9.85μs/循环
%%timeit
df.get_value(2, 2)
The slowest run took 19.29 times longer than the fastest.
最慢的跑步比最快跑的时间长19.29倍。 This could mean that an intermediate result is being cached.
这可能意味着正在缓存中间结果。 100000 loops, best of 3: 3.57 µs per loop
100000个循环,最佳3:每循环3.57μs
iloc
is pretty general, accepting slices and lists as well as simple integers. iloc
很通用,接受切片和列表以及简单的整数。 In the case above, where you have simple integer indexing, pandas first determines that it is a valid integer, then it converts the request to an iat
index, so clearly it will be much slower. 在上面的例子中,你有简单的整数索引,pandas首先确定它是一个有效的整数,然后它将请求转换为一个
iat
索引,所以很明显它会慢得多。 iat
eventually resolves down to a call to get_value
, so naturally a direct call to get_value
is going to be fast. iat
最终解析为对get_value
的调用,所以直接调用get_value
会很快。 get_value
itself is cached, so micro-benchmarks like these may not reflect performance in real code. get_value
本身是缓存的,因此像这样的微基准测试可能无法反映实际代码中的性能。
df.values
does return an ndarray, but only after checking that it is a single contiguous block. df.values
确实返回一个ndarray,但只有在检查它是一个连续的块之后。 This requires a few lookups and tests so it is a little slower than retrieving the value from the cache. 这需要一些查找和测试,因此它比从缓存中检索值要慢一些。
We can defeat the caching by creating a new data frame every time. 我们可以通过每次创建一个新的数据框来打败缓存。 This shows that
values
accessor is fastest, at least for data of a uniform type: 这表明
values
访问器是最快的,至少对于统一类型的数据:
In [111]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4))
10000 loops, best of 3: 186 µs per loop
In [112]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.values[2,2]
1000 loops, best of 3: 200 µs per loop
In [113]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.get_value(2,2)
1000 loops, best of 3: 309 µs per loop
In [114]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.iat[2,2]
1000 loops, best of 3: 308 µs per loop
In [115]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.iloc[2,2]
1000 loops, best of 3: 420 µs per loop
In [116]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.ix[2,2]
1000 loops, best of 3: 316 µs per loop
The code claims that ix
is the most general, and so should be in theory be slower than iloc
; 该代码声称
ix
是最通用的,因此理论上应该比iloc
慢; it may be that your particular test favours ix
but other tests may favour iloc
just because of the order of the tests needed to identify the index as a scalar index. 可能是你的特定测试有利于
ix
但其他测试可能只是因为将索引标识为标量索引所需的测试顺序而有利于iloc
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.