简体   繁体   English

如何像pandas数据帧一样快速访问numpy数组

[英]How do I access a numpy array as quickly as a pandas dataframe

I ran a comparison of several ways to access data in a DataFrame . 我对几种访问DataFrame数据的方法进行了比较。 See results below. 见下面的结果。 The quickest access was from using the get_value method on a DataFrame . 最快的访问是在DataFrame上使用get_value方法。 I was referred to this on this post . 我在这篇文章中提到了这一点。

What I was surprised by is that the access via get_value is quicker than accessing via the underlying numpy object df.values . 令我惊讶的是,通过get_value访问比通过底层numpy对象df.values访问更快。

Question

My question is, is there a way to access elements of a numpy array as quickly as I can access a pandas dataframe via get_value ? 我的问题是,有没有办法像我可以通过get_value访问pandas数据帧一样快速访问numpy数组的元素?

Setup 设定

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(16).reshape(4, 4))

Testing 测试

%%timeit
df.iloc[2, 2]

10000 loops, best of 3: 108 µs per loop 10000循环,最佳3:每循环108μs

%%timeit
df.values[2, 2]

The slowest run took 5.42 times longer than the fastest. 最慢的运行时间比最快的运行时长5.42倍。 This could mean that an intermediate result is being cached. 这可能意味着正在缓存中间结果。 100000 loops, best of 3: 8.02 µs per loop 100000次循环,最佳3:每循环8.02μs

%%timeit
df.iat[2, 2]

The slowest run took 4.96 times longer than the fastest. 最慢的运行时间比最快的运行时长4.96倍。 This could mean that an intermediate result is being cached. 这可能意味着正在缓存中间结果。 100000 loops, best of 3: 9.85 µs per loop 100000个循环,最佳3:9.85μs/循环

%%timeit
df.get_value(2, 2)

The slowest run took 19.29 times longer than the fastest. 最慢的跑步比最快跑的时间长19.29倍。 This could mean that an intermediate result is being cached. 这可能意味着正在缓存中间结果。 100000 loops, best of 3: 3.57 µs per loop 100000个循环,最佳3:每循环3.57μs

iloc is pretty general, accepting slices and lists as well as simple integers. iloc很通用,接受切片和列表以及简单的整数。 In the case above, where you have simple integer indexing, pandas first determines that it is a valid integer, then it converts the request to an iat index, so clearly it will be much slower. 在上面的例子中,你有简单的整数索引,pandas首先确定它是一个有效的整数,然后它将请求转换为一个iat索引,所以很明显它会慢得多。 iat eventually resolves down to a call to get_value , so naturally a direct call to get_value is going to be fast. iat最终解析为对get_value的调用,所以直接调用get_value会很快。 get_value itself is cached, so micro-benchmarks like these may not reflect performance in real code. get_value本身是缓存的,因此像这样的微基准测试可能无法反映实际代码中的性能。

df.values does return an ndarray, but only after checking that it is a single contiguous block. df.values确实返回一个ndarray,但只有在检查它是一个连续的块之后。 This requires a few lookups and tests so it is a little slower than retrieving the value from the cache. 这需要一些查找和测试,因此它比从缓存中检索值要慢一些。

We can defeat the caching by creating a new data frame every time. 我们可以通过每次创建一个新的数据框来打败缓存。 This shows that values accessor is fastest, at least for data of a uniform type: 这表明values访问器是最快的,至少对于统一类型的数据:

In [111]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4))
10000 loops, best of 3: 186 µs per loop

In [112]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.values[2,2]
1000 loops, best of 3: 200 µs per loop

In [113]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.get_value(2,2)
1000 loops, best of 3: 309 µs per loop

In [114]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.iat[2,2]
1000 loops, best of 3: 308 µs per loop

In [115]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.iloc[2,2]
1000 loops, best of 3: 420 µs per loop

In [116]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.ix[2,2]
1000 loops, best of 3: 316 µs per loop

The code claims that ix is the most general, and so should be in theory be slower than iloc ; 该代码声称ix是最通用的,因此理论上应该比iloc慢; it may be that your particular test favours ix but other tests may favour iloc just because of the order of the tests needed to identify the index as a scalar index. 可能是你的特定测试有利于ix但其他测试可能只是因为将索引标识为标量索引所需的测试顺序而有利于iloc

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM