通过Pandas Dataframe进行迭代和索引的最快方法

Question

I have an array of 50k strings called products`- and a dataframe of about 22 million rows called all 我有一个由5万个字符串组成的数组，称为product`-大约有2200万行的数据框，称为all

I want to iterate through the array and then select a corresponding subset of the dataframe that contains the array value: 我想遍历数组，然后选择包含数组值的数据框的相应子集：

for i in products:
 all.query('id == i')

Each query takes about 1.5s to compute, with 50k values in my array that will take me about 20 hours. 每个查询大约需要1.5秒的时间来计算，而我数组中的50k值将花费我大约20个小时。

Do you know any faster way to compute this ? 您知道更快的方法吗？

Answer 1

If you want to select all rows with ids in the products list, this should be much faster than a for loop: 如果要在产品列表中选择所有具有ID的行，这应该比for循环快得多：

import numpy as np    
df[np.in1d(df.id,products)]

Answer 2

In order to test this, I generated my own version of these dataframes (not sure if the statistical properties are the same, but the timing results seem similar to what you're getting): 为了测试这一点，我生成了自己的这些数据框版本（不确定统计属性是否相同，但是计时结果似乎与您得到的相似）：

import pandas as pd
import numpy as np

import uuid

products = pd.Series([uuid.uuid4().hex for i in range(50000)])
all_products = pd.DataFrame(np.random.choice(products,
                                             size=(int(22e6),), replace=True),
                            columns=['id'])

Binary search method 二进制搜索法

One way to do this is to sort your all dataframe and use searchsorted to do the queries as binary searches - which has a one-time heavy cost sorting the 22M rows ( n log n ), but makes the lookups much faster ( log n ). 一种方法是对all数据帧进行排序，并使用searchsorted将查询作为二进制搜索进行-一次性耗费大量时间对2200万行进行排序（ n log n ），但使查找速度更快（ log n ）。 This may be the fastest way to achieve your explicitly stated goal: 这可能是实现您明确声明的目标的最快方法：

import timeit
s = timeit.default_timer()
all_products_sorted = all_products.sort_values(by='id')
e = timeit.default_timer()
print('Time to sort: {:0.5f}'.format((e - s) / N))
# Time to sort: 11.27207

N = 1000
s = timeit.default_timer()
for _, i in zip(range(N), products):
    start = all_products_sorted['id'].searchsorted(i, side='left')
    end = all_products_sorted['id'].searchsorted(i, side='right')
    x = all_products_sorted['id'].iloc[start[0]:end[0]]
e = timeit.default_timer()

print('{:0.5f}s per query'.format((e - s) / N))
# 0.00038s per query

So it seems that you can expect to sort the rows in around 12s, and then query the 50,000 rows in another ~20s, for a total of 32s. 因此，您似乎可以期望在12秒左右的时间内对行进行排序，然后再在20秒左右的时间内查询50,000行，总共需要32秒。 In my example I don't actually save the results, but I assume once you have the indices into the all_products dataframe (don't call it all because that's a Python builtin!), you can store them as desired. 在我的示例中，我实际上并没有保存结果，但是我假设一旦将索引包含在all_products数据all_products （不要all调用，因为那是Python内置的！），就可以根据需要存储它们了。

Groupby Method 分组方式

Another method, which (according to my test), is considerably faster if all_products consists entirely or mostly of values from products (as mine does), is to group all_products by id and dump the result into a dictionary (or whatever you want to do with it): 另一种方法（根据我的测试），如果all_products 全部或大部分由products值组成（如我的所做的那样），则速度要快得多（按我的方法），该方法是按id对all_products进行分组，然后将结果转储到字典中（或您要执行的任何操作）用它）：

s = timeit.default_timer()
x_dict = {k: v for k, v in all_products.groupby('id')}
e = timeit.default_timer()
print('{:0.5f}s per query'.format((e - s) / len(products)))
# 0.00032s per query

Note that in this case it is apparently faster than the searchsorted method (though not considerably), and doesn't require the input to be sorted in the first place. 请注意，在这种情况下，它显然比searchsorted方法要快（尽管不算很多），并且不需要首先对输入进行排序。

Note that if what you actually want to do is transform these rows or modify them in some way, in this case groupby is definitely the way to go - don't even bother dumping to a dictionary, instead see the split-apply-combine page for strategies on working with Dataframes in this way. 请注意，如果您实际要做的是转换这些行或以某种方式对其进行修改，在这种情况下， groupby绝对是groupby的方法-甚至不必费心将其转储到字典中，而是请参见split-apply-combine页面有关以这种方式使用数据框的策略。

Naive methods 天真的方法

For comparison, here are two approaches that involve full searches: 为了进行比较，以下是涉及完全搜索的两种方法：

import timeit
N = 5
s = timeit.default_timer()
for _, i in zip(range(N), products):
    x = all_products.query('id == "{}"'.format(i))
e = timeit.default_timer()

print('{:0.5f}s per query'.format((e - s) / N))  # 1.60075s per query


s = timeit.default_timer()
for _, i in zip(range(N), products):
    x = all_products[all_products['id'] == i]
e = timeit.default_timer()

print('{:0.5f}s per query'.format((e - s) / N))  # 3.00135s per query

通过Pandas Dataframe进行迭代和索引的最快方法

问题描述

2 个解决方案

解决方案1
1 2017-06-25 23:20:40

解决方案2
1 已采纳 2017-06-26 00:46:21

通过Pandas Dataframe进行迭代和索引的最快方法

问题描述

2 个解决方案

解决方案1 1 2017-06-25 23:20:40

解决方案2 1 已采纳 2017-06-26 00:46:21

解决方案1
1 2017-06-25 23:20:40

解决方案2
1 已采纳 2017-06-26 00:46:21