Pandas `isin` 函数的更快替代方案

Question

I have a very large data frame df that looks like:我有一个非常大的数据框df ，它看起来像：

ID       Value1    Value2
1345      3.2      332
1355      2.2      32
2346      1.0      11
3456      8.9      322

And I have a list that contains a subset of IDs ID_list .我有一个包含 ID ID_list子集的ID_list 。 I need to have a subset of df for the ID contained in ID_list .对于ID_list包含的ID我需要有一个df子集。

Currently, I am using df_sub=df[df.ID.isin(ID_list)] to do it.目前，我正在使用df_sub=df[df.ID.isin(ID_list)]来做到这一点。 But it takes a lot time.但这需要很多时间。 ID s contained in ID_list doesn't have any pattern, so it's not within certain range. ID_list中包含的ID没有任何模式，因此不在一定范围内。 (And I need to apply the same operation to many similar dataframes. I was wondering if there is any faster way to do this. Will it help a lot if make ID as the index? （而且我需要对许多类似的数据帧应用相同的操作。我想知道是否有更快的方法来做到这一点。如果将ID作为索引会有很大帮助吗？

Thanks!谢谢！

Answer 1

EDIT 2: Here's a link to a more recent look into the performance of various pandas operations, though it doesn't seem to include merge and join to date.编辑 2：这是对各种pandas操作性能的最新研究的链接，尽管迄今为止它似乎不包括合并和连接。

https://github.com/mm-mansour/Fast-Pandas https://github.com/mm-mansour/Fast-Pandas

EDIT 1: These benchmarks were for a quite old version of pandas and likely are not still relevant.编辑 1：这些基准测试是针对一个相当旧版本的熊猫，可能仍然不相关。 See Mike's comment below on merge .请参阅下面关于merge Mike 评论。

It depends on the size of your data but for large datasets DataFrame.join seems to be the way to go.这取决于数据的大小，但对于大型数据集DataFrame.join似乎是要走的路。 This requires your DataFrame index to be your 'ID' and the Series or DataFrame you're joining against to have an index that is your 'ID_list'.这要求您的 DataFrame 索引是您的“ID”，并且您要加入的系列或 DataFrame 的索引是您的“ID_list”。 The Series must also have a name to be used with join , which gets pulled in as a new field called name .系列还必须有一个与join一起使用的name ，它作为一个名为name的新字段被拉入。 You also need to specify an inner join to get something like isin because join defaults to a left join.您还需要指定一个内部连接来获得类似isin因为join默认为左连接。 query in syntax seems to have the same speed characteristics as isin for large datasets.查询in语法似乎有相同的速度特性isin用于大型数据集。

If you're working with small datasets, you get different behaviors and it actually becomes faster to use a list comprehension or apply against a dictionary than using isin .如果您使用的是小型数据集，您会得到不同的行为，并且实际上使用列表isin或对字典应用比使用isin更快。

Otherwise, you can try to get more speed with Cython .否则，您可以尝试使用Cython提高速度。

# I'm ignoring that the index is defaulting to a sequential number. You
# would need to explicitly assign your IDs to the index here, e.g.:
# >>> l_series.index = ID_list
mil = range(1000000)
l = mil
l_series = pd.Series(l)

df = pd.DataFrame(l_series, columns=['ID'])


In [247]: %timeit df[df.index.isin(l)]
1 loops, best of 3: 1.12 s per loop

In [248]: %timeit df[df.index.isin(l_series)]
1 loops, best of 3: 549 ms per loop

# index vs column doesn't make a difference here
In [304]: %timeit df[df.ID.isin(l_series)]
1 loops, best of 3: 541 ms per loop

In [305]: %timeit df[df.index.isin(l_series)]
1 loops, best of 3: 529 ms per loop

# query 'in' syntax has the same performance as 'isin'
In [249]: %timeit df.query('index in @l')
1 loops, best of 3: 1.14 s per loop

In [250]: %timeit df.query('index in @l_series')
1 loops, best of 3: 564 ms per loop

# ID must be the index for DataFrame.join and l_series must have a name.
# join defaults to a left join so we need to specify inner for existence.
In [251]: %timeit df.join(l_series, how='inner')
10 loops, best of 3: 93.3 ms per loop

# Smaller datasets.
df = pd.DataFrame([1,2,3,4], columns=['ID'])
l = range(10000)
l_dict = dict(zip(l, l))
l_series = pd.Series(l)
l_series.name = 'ID_list'


In [363]: %timeit df.join(l_series, how='inner')
1000 loops, best of 3: 733 µs per loop

In [291]: %timeit df[df.ID.isin(l_dict)]
1000 loops, best of 3: 742 µs per loop

In [292]: %timeit df[df.ID.isin(l)]
1000 loops, best of 3: 771 µs per loop

In [294]: %timeit df[df.ID.isin(l_series)]
100 loops, best of 3: 2 ms per loop

# It's actually faster to use apply or a list comprehension for these small cases.
In [296]: %timeit df[[x in l_dict for x in df.ID]]
1000 loops, best of 3: 203 µs per loop

In [299]: %timeit df[df.ID.apply(lambda x: x in l_dict)]
1000 loops, best of 3: 297 µs per loop

Answer 2

Yes, isin is quite slow.是的， isin很慢。

Instead it's faster to make ID an index then use use loc , like:相反，将ID设为索引然后使用 use loc更快，例如：

df.set_index('ID', inplace=True)
df.loc[list_of_indices]

Actually what brought me to this page was that I needed to create a label in my df based on index in another df: "if df_1's index matches df_2's index, label it a 1, otherwise NaN", which I accomplished like this:实际上，让我进入此页面的是我需要根据另一个 df 中的索引在我的df创建一个标签：“如果 df_1 的索引与 df_2 的索引匹配，则将其标记为 1，否则为 NaN”，我是这样完成的：

df_2['label'] = 1  # Create a label column
df_1.join(df_2['label'])

Which is also very fast.这也非常快。

Pandas `isin` 函数的更快替代方案

问题描述

2 个解决方案

解决方案1
42 已采纳

解决方案2
2 2020-06-16 13:56:30

Pandas `isin` 函数的更快替代方案

问题描述

2 个解决方案

解决方案1 42 已采纳

解决方案2 2 2020-06-16 13:56:30

解决方案1
42 已采纳

解决方案2
2 2020-06-16 13:56:30