查询数据框的最快方法

Question

I want to make aggregation operations (sum) on the rows of a big pandas dataframe(millions of rows) which are determined by a condition on several fixed columns (max 10 columns). 我想对大熊猫数据框（数百万行）的行进行聚合操作（求和），这取决于几个固定列（最多10列）上的条件。 These columns have only integer values. 这些列只有整数值。

My problem is that I have to make this operation (querying + aggregating) thousands of times (~100 000 times). 我的问题是我必须使该操作（查询+汇总）成千上万次（约100000次）。 I think with the aggregating part there is not much to optimize as it is just a simple sum. 我认为对于聚合部分，没有什么要优化的，因为这只是一个简单的总和。 What would be the most efficient way to perform this task? 什么是执行此任务的最有效方法？ Is there some way I could build an 'index' on my condition columns in order to speed up each query? 有什么办法可以在条件列上建立一个“索引”以加快每个查询的速度？

Answer 1

I would try something in this flavor: 我会尝试这种口味的东西：

Suppose you have the following dataframe 假设您具有以下数据框

N = 10000000
df = pd.DataFrame({
    'A':np.random.binomial(1,0.5,N),
    'B':np.random.binomial(2,0.5,N),
    'nume1':np.random.uniform(0,1,N),
    'nume2':np.random.normal(0,1,N)})

then doing this 然后这样做

tmp = df[['A','B','nume1','nume2']].query('A > 0.5').groupby('B').sum().reset_index()[['B','nume1','nume2']]

is the SQL equivalent of 是SQL的等效项

select B, sum(nume1),sum(nume2)
from df
where A > 0.5
group by B

this takes a little less then a sec (926ms, using %timeit) on my moderate (i7 quad-core, 16GB ram) machine. 在我的中等（i7四核，16GB内存）计算机上，这花费的时间不到一秒（926ms，使用％timeit）。

I hope this helps. 我希望这有帮助。

Answer 2

Without more details it's hard to answer your question. 没有更多细节，很难回答您的问题。

You should indeed build an index of your conditional columns. 您确实应该为条件列建立索引。

df['idx'] = (df['col1'] * df['col2']) ** (df['col3'] + df['col4']) * df['col5'] == 0.012
df = df.set_index('idx')

Rewriting your condition to an indexable column may be hard. 将条件重写到可索引的列可能很困难。 Keep in mind you can set all the columns as the index 请记住，您可以将所有列设置为索引

df = df.set_index(['col1', 'col2', 'col3', 'col4', 'col5' ...])

This documentation on advanced indexing in Pandas may help you think about your problem: http://pandas.pydata.org/pandas-docs/stable/indexing.html#multiindex-query-syntax 有关Pandas中高级索引编制的文档，可以帮助您考虑问题： http : //pandas.pydata.org/pandas-docs/stable/indexing.html#multiindex-query-syntax

查询数据框的最快方法

问题描述

2 个解决方案

解决方案1
1 2014-05-17 15:19:35

解决方案2
1 2014-09-29 16:04:43

查询数据框的最快方法

问题描述

2 个解决方案

解决方案1 1 2014-05-17 15:19:35

解决方案2 1 2014-09-29 16:04:43

解决方案1
1 2014-05-17 15:19:35

解决方案2
1 2014-09-29 16:04:43