检查数据框中的 ID 是否存在于另一个数据框中的最快方法

Question

I have large pandas dataframe (around million rows) and a list of id-s (length of array is 100,000).我有大熊猫数据框（大约一百万行）和一个 id-s 列表（数组长度为 100,000）。 For each id in df1 I have to check if that id is in my list (called special ) and flag it accordingly:对于 df1 中的每个 id，我必须检查该 id 是否在我的列表中（称为special ）并相应地对其进行标记：

df['Segment'] = df['ID'].apply(lambda x: 1 if x in special else np.nan)

problem is that this is extremely slow, as for million id-s lambda expression checks if that id is in a list of 100,000 entries.问题是这非常慢，因为百万 id-s lambda 表达式检查该 id 是否在 100,000 个条目的列表中。 Is there a faster way to accomplish this?有没有更快的方法来实现这一点？

Answer 1

I recommend you see When should I ever want to use apply我建议你看看When should I ever want to use apply

Use Series.isin with Series.astype :使用Series.isin和Series.astype ：

 df['Segment'] = df['ID'].isin(special).astype(int)

We can also use Series.view :我们也可以使用Series.view ：

df['Segment'] = df['ID'].isin(special).view('uint8')

or numpy.where或numpy.where

df['Segment'] = np.where(df['ID'].isin(special),1 ,0)

检查数据框中的 ID 是否存在于另一个数据框中的最快方法

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-01-24 10:52:10

检查数据框中的 ID 是否存在于另一个数据框中的最快方法

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-01-24 10:52:10

解决方案1
2 已采纳 2020-01-24 10:52:10