寻找在巨大的 Pandas Dataframe 中切片一行的最快方法

Question

My program needs to fetch a row based on the value in a column from a huge Pandas Dataframe.我的程序需要根据来自巨大 Pandas 数据帧的列中的值获取一行。 The response time is critical.响应时间很关键。 I use the most common way to do it, for example:我使用最常见的方法来做到这一点，例如：

df.loc[df['id'] == 500000, :]

Per timeit on my Mac it took 4 ms to complete the above operation on a dataframe with 1 million rows.每timeit我的Mac上花了4毫秒来完成对数据框上面的操作以1万行。 But my goal is to reduce the time to like 0.4 ms.但我的目标是将时间减少到 0.4 毫秒。 I once consider to convert this dataframe to a Set but Set is not ordered and does not natively support indexing or slicing.我曾经考虑将此数据帧转换为 Set 但 Set 没有排序，并且本身不支持索引或切片。 Any suggestions?有什么建议？

Answer 1

Lets set this up:让我们设置一下：

import pandas as pd
import numpy as np
df = pd.DataFrame({"id": np.random.randint(100,size=(1000000,))})

Then let's benchmark some options.然后让我们对一些选项进行基准测试。 Your current boolean + .loc :您当前的 boolean + .loc ：

>>> timeit.timeit("df.loc[df['id'] == 50, :]", setup = "from __main__ import df", number=1000)
2.566220869999597

The query engine:查询引擎：

>>> timeit.timeit("df.query('id == 50')", setup = "from __main__ import df", number=1000)
14.591400260000228

Using the index as a separate lookup:使用索引作为单独的查找：

>>> idx = pd.Index(df['id'])
>>> timeit.timeit("df.loc[idx == 50, :]", setup = "from __main__ import df, idx", number=1000)
2.2155187300013495

Using the dataframe index for lookup:使用数据帧索引进行查找：

>>> df.index = df["id"]
>>> timeit.timeit("df.loc[50, :]", setup = "from __main__ import df", number=1000)
2.625610274999417

And that .isin() idea that someone in the comments had:评论中有人提出的.isin()想法：

>>> timeit.timeit("df.loc[df['id'].isin([50]), :]", setup = "from __main__ import df", number=1000)
9.542700138999862

Looks like with the exception of the query engine being slow (as expected) for a simple equality, you're not going to get much better than the lookup time you've got.看起来除了查询引擎很慢（正如预期的那样）对于简单的相等性之外，您不会比您拥有的查找时间好得多。

df_unique = pd.DataFrame({'id': range(1000000)})

Lets see how a unique ID might be helpful:让我们看看唯一 ID 可能有什么帮助：

>>> timeit.timeit("df_unique.loc[df_unique['id'] == 50, :]", setup = "from __main__ import df_unique", number=1000)
1.9672015519990964

Then to a dict:然后到一个字典：

>>> df_unique.index = df_unique['id']
>>> df_dict = df_unique.to_dict(orient='index')
>>> timeit.timeit("df_dict[50]", setup = "from __main__ import df_dict", number=1000)
6.247700002859347e-05

Well, looks like this is a clear winner.好吧，看起来这是一个明显的赢家。

>>> timeit.timeit("pd.Series(df_dict[50])", setup = "from __main__ import df_dict, pd", number=1000)
0.2747819870000967

Even if you have to cast it back to a series for something this is an order of magnitude faster than before.即使您必须将其重新转换为某个系列，这也比以前快了一个数量级。 (You also could map a series back into the dict very easily if needed and keep the speed of a dict lookup with no overhead) （如果需要，您也可以很容易地将一个系列映射回 dict 并保持 dict 查找的速度而没有开销）

Answer 2

检查df.query('id == 500000') 的工作速度。

寻找在巨大的 Pandas Dataframe 中切片一行的最快方法

问题描述

2 个解决方案

解决方案1
4 已采纳 2019-07-25 20:42:30

解决方案2
0 2019-07-25 19:49:40

寻找在巨大的 Pandas Dataframe 中切片一行的最快方法

问题描述

2 个解决方案

解决方案1 4 已采纳 2019-07-25 20:42:30

解决方案2 0 2019-07-25 19:49:40

解决方案1
4 已采纳 2019-07-25 20:42:30

解决方案2
0 2019-07-25 19:49:40