简体   繁体   English

使用 MultiIndex 缓慢搜索大型 DataFrame

[英]Searching a large DataFrame with a MultiIndex slow

I have a large Pandas DataFrame (~800M rows), which I have indexed on a MultiIndex with two indices, an int and a date.我有一个大的 Pandas DataFrame(~800M 行),我在一个MultiIndex上建立了索引,它有两个索引,一个 int 和一个日期。 I want to retrieve a subset of the DataFrame's rows based on a list of ints (about 10k) that I have.我想根据我拥有的整数列表(大约 10k)检索 DataFrame 行的子集。 The ints match the first index of the multi-index.整数与多索引的第一个索引匹配。 The multi-index is unique.多索引是唯一的。

The first thing I tried is to sort the index and then query it using loc :我尝试的第一件事是对索引进行排序,然后使用loc查询它:

df = get_my_df()  # 800M rows
ids = [...]       # 10k ints, sorted list

df.set_index(["int_idx", "date_idx"], inplace=True, drop=False)
df.sort_index(inplace=True)

idx = pd.IndexSlice
res = df.loc[idx[ids, :]]

However this is painfully slow, and I stopped running the code after about an hour.然而,这非常慢,大约一个小时后我停止运行代码。

Next thing I tried was to set only the first one as index.接下来我尝试的是只将第一个设置为索引。 This is suboptimal for me because the index is not unique, and also later I'll need to to further filter by date:这对我来说不是最理想的,因为索引不是唯一的,而且稍后我需要按日期进一步过滤:

df.set_index("int_idx", inplace=True, drop=False)
df.sort_index(inplace=True)

idx = pd.IndexSlice
res = df.loc[idx[ids, :]]

To my surprise this was an improvement, but still very slow.令我惊讶的是,这是一个改进,但仍然非常缓慢。

I have two questions:我有两个问题:

  1. How can I make my query faster?我怎样才能使我的查询更快? (Either using single index or multi-index) (使用单索引或多索引)
  2. Why is a sorted multi-index still so slow?为什么排序的多索引仍然这么慢?

It can be difficult to retrieve a subset of a DataFrame containing 800M rows.检索包含 800M 行的 DataFrame 的子集可能很困难。 Here are some ideas to help your search go more quickly:这里有一些想法可以帮助您更快地搜索 go:

  1. Use.loc() with boolean indexing instead of pd.IndexSlice: Use.loc() 与 boolean 索引而不是 pd.IndexSlice:

Use boolean indexing with.loc() instead of pd.IndexSlice to slice your multi-index instead.使用 boolean 索引 with.loc() 而不是 pd.IndexSlice 来切片你的多索引。 This can assist Pandas in avoiding the costly practise of establishing a new index object for each slice when working with huge DataFrames.这可以帮助 Pandas 避免在处理巨大的 DataFrame 时为每个切片建立新索引 object 的昂贵做法。

For example:例如:

res = df.loc[df.index.get_level_values('int_idx').isin(ids)]
  1. Avoid setting the index multiple times:避免多次设置索引:

It can be costly to set the index and sort the data numerous times.多次设置索引和排序数据的成本可能很高。 Try to just set the index once if you can, but try to avoid sorting it.如果可以,尽量只设置一次索引,但尽量避免对其进行排序。

For example:例如:

df.set_index(["int_idx", "date_idx"], inplace=True, drop=False)
res = df[df.index.get_level_values('int_idx').isin(ids)]
  1. Use chunking or parallel processing:使用分块或并行处理:

You might want to think about dividing your DataFrame into smaller parts, processing each one separately, and then concatenating the results if it is too big to store in memory. To speed up the query, you might also use parallel processing.您可能需要考虑将 DataFrame 分成更小的部分,分别处理每个部分,如果结果太大而无法存储在 memory 中,则将结果连接起来。要加快查询速度,您也可以使用并行处理。 Both of these tactics work well with the Dask library.这两种策略都适用于 Dask 库。

In response to your second query, a sorted multi-index ought should be quicker than an unsorted one because it enables Pandas to utilise the quick search methods built into NumPy. However, if a huge DataFrame has numerous columns or the sorting order is complicated, sorting the data can be expensive.在响应您的第二个查询时,排序的多索引应该比未排序的多索引更快,因为它使 Pandas 能够利用 NumPy 内置的快速搜索方法。但是,如果一个巨大的 DataFrame 有很多列或排序顺序很复杂,对数据进行排序可能很昂贵。 Generally speaking, sorting a DataFrame is an expensive process that should be avoided wherever possible.一般来说,排序 DataFrame 是一个昂贵的过程,应该尽可能避免。

MultiIndices are a wonderful convenience but are in my experience very slow. MultiIndices非常方便,但根据我的经验,速度非常慢。 That's on top of the huge overhead that pandas already adds over numpy for even single-depth row and column labeling.这是pandas已经为单深度行和列标签添加超过numpy的巨大开销之上的。

If your index/columns are fairly stable and everything else can be done in numpy , you will see huge speed improvements by managing your indices separately and converting to numpy using .to_numpy() .如果您的索引/列相当稳定并且其他所有事情都可以在numpy中完成,您将通过单独管理索引并使用.to_numpy()转换为numpy来看到巨大的速度提升。 Depending on the code, I've seen improvements of over 100x .根据代码,我看到了超过 100 倍的改进。 First convert your index to a dict of index:iloc, and then use that to do an integer-based row lookup.首先将您的索引转换为 index:iloc 的dict ,然后使用它进行基于整数的行查找。

index_dict = {idx:i for i,idx in enumerate(df.index.tolist())}
n_df = df.to_numpy()
row_ilocs = [index_dict[x] for x in ids]  # get list of 0-based locations in n_df
res = n_df[row_ilocs, :]

If you need to do your row lookups based on the first index level only, the index is just a list of tuples, so it's easy to write a list comprehension for that outside of pandas.如果您只需要根据第一个索引级别进行行查找,则索引只是一个元组列表,因此很容易为 pandas之外的内容编写列表理解。

If you'd rather not move into numpy, you'll still get big improvements (on a MultiIndex perhaps even 10x) by using .iloc over .loc .如果您不想进入 numpy,您仍然可以通过使用.iloc而不是.loc获得很大的改进(在MultiIndex上什至可能是 10 倍)。 For example:例如:

index_dict = {idx:i for i,idx in enumerate(df.index.tolist())}
row_ilocs = [index_dict[x] for x in ids]  # get list of 0-based locations in df
res = df.iloc[row_ilocs]

Preferably you would only convert to index_dict once and keep it, or even better create it alongside your initial df generation.最好您只转换一次index_dict并保留它,或者甚至更好地在您的初始 df 生成过程中创建它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM