简体   繁体   English

检查数据框中的 ID 是否存在于另一个数据框中的最快方法

[英]Fastest way to check if an ID in your dataframe exists in another dataframe

I have large pandas dataframe (around million rows) and a list of id-s (length of array is 100,000).我有大熊猫数据框(大约一百万行)和一个 id-s 列表(数组长度为 100,000)。 For each id in df1 I have to check if that id is in my list (called special ) and flag it accordingly:对于 df1 中的每个 id,我必须检查该 id 是否在我的列表中(称为special )并相应地对其进行标记:

df['Segment'] = df['ID'].apply(lambda x: 1 if x in special else np.nan)

problem is that this is extremely slow, as for million id-s lambda expression checks if that id is in a list of 100,000 entries.问题是这非常慢,因为百万 id-s lambda 表达式检查该 id 是否在 100,000 个条目的列表中。 Is there a faster way to accomplish this?有没有更快的方法来实现这一点?

I recommend you see When should I ever want to use apply我建议你看看When should I ever want to use apply

Use Series.isin with Series.astype :使用Series.isinSeries.astype

 df['Segment'] = df['ID'].isin(special).astype(int)

We can also use Series.view :我们也可以使用Series.view

df['Segment'] = df['ID'].isin(special).view('uint8')

or numpy.wherenumpy.where

df['Segment'] = np.where(df['ID'].isin(special),1 ,0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM