简体   繁体   English

提高熊猫布尔索引的速度

[英]Improve speed of pandas boolean indexing

Using the boolean indexing with a sample data worked fine, but as I increased the size of the data, the computing time is getting exponentially long (example below).对样本数据使用布尔索引工作正常,但随着我增加数据的大小,计算时间呈指数级增长(下面的示例)。 Anyone knows a way to increase the speed of that particular boolean indexer ?有人知道提高特定布尔索引器速度的方法吗?

import pandas as pd
import numpy as np
a = pd.date_range('2019-01-01', '2019-12-31',freq = '1T')
b = np.random.normal(size = len(a), loc = 50)
c = pd.DataFrame(index = a, data = b, columns = ['price'])

1500 rows: 1500 行:

z = c.head(1500)
z[z.index.map(lambda x : 8 <= x.hour <= 16 ) & z.index.map(lambda x : x.weekday() < 5 )]

CPU times: user 149 ms, sys: 8.71 ms, total: 158 ms Wall time: 157 ms

5000 rows: 5000 行:

z = c.head(5000)
z[z.index.map(lambda x : 8 <= x.hour <= 16 ) & z.index.map(lambda x : x.weekday() < 5 )]

CPU times: user 14.1 s, sys: 9.07 s, total: 23.2 s Wall time: 23.2 s

I tried with z = c.head(10000) but it's taking more than 15 minutes to comput so i stopped... The size of the data I want to use that indexer on is about 30000 rows.我尝试使用z = c.head(10000)但计算时间超过 15 分钟,所以我停止了......我想使用该索引器的数据大小约为 30000 行。

Both z.index.map(lambda x : 8 <= x.hour <= 16) and z.index.map(lambda x: x.weekday() < 5) execute almost instantly. z.index.map(lambda x : 8 <= x.hour <= 16)z.index.map(lambda x: x.weekday() < 5)几乎立即执行。 The problem occurs when you combine these with the bitwise and operator, & .当您将这些与按位和运算符&组合时会出现问题。

pd.Index.map returns another pd.Index object. pd.Index.map 返回另一个 pd.Index 对象。 And the & operator on Index objects actually does set intersection; Index 对象上的&运算符实际上确实设置了交集; it is not "element-wise and".它不是“元素和”。 If you take a look at the result you will see that it is not what you expect -- it is 5000 True s.如果您查看结果,您会发现它不是您所期望的——它是 5000 True秒。 The reason it is taking so long is that these comparisons return boolean values which are of course duplicated and index intersection fails in that situation.花费这么长时间的原因是这些比较返回的布尔值当然是重复的,并且索引交集在这种情况下失败。

The proper way of handling this is of course using vectorized operations but if you somehow need to element-wise compare two pd.Index object, you can do so by converting them to numpy arrays:处理这个问题的正确方法当然是使用矢量化操作,但如果您需要按元素比较两个 pd.Index 对象,您可以通过将它们转换为 numpy 数组来实现:

res1 = z.index.map(lambda x : 8 <= x.hour <= 16 ).to_numpy()
res2 = z.index.map(lambda x : x.weekday() < 5 ).to_numpy()
z[res1 & res2]

The reason this does not work fast is because you perform a mapping with a lambda expression, so that means that for each item, a function call will be made.这不能快速工作的原因是因为您使用lambda表达式执行映射,这意味着对于每个项目,都会进行函数调用。 This is typically not a good idea if you want to process data in "bulk".如果您想“批量”处理数据,这通常不是一个好主意。 You can speed this up with:您可以通过以下方式加快速度:

hour = z.index.hour
z[(8 <= hour) & (hour <= 16) & (z.index.weekday < 5)]

With z = c (so a total of 524'161 rows), we get the following timings: z = c (所以总共有 524'161 行),我们得到以下时间:

>>> z = c
>>> timeit(lambda: z[(8 <= z.index.hour) & (z.index.hour <= 16) & (z.index.weekday < 5)], number=100)
11.825318349001464

So this runs in a total of ~118 milliseconds per run.所以每次运行总共需要大约 118 毫秒。

When we use the first 5'000 rows, we get:当我们使用前 5'000 行时,我们得到:

>>> z = c.head(5000)
>>> timeit(lambda: z[(8 <= z.index.hour) & (z.index.hour <= 16) & (z.index.weekday < 5)], number=100)
0.1542488380218856

So this runs in 1.5 milliseconds per run.所以每次运行需要 1.5 毫秒。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM