简体   繁体   中英

Fastest way to check if an ID in your dataframe exists in another dataframe

I have large pandas dataframe (around million rows) and a list of id-s (length of array is 100,000). For each id in df1 I have to check if that id is in my list (called special ) and flag it accordingly:

df['Segment'] = df['ID'].apply(lambda x: 1 if x in special else np.nan)

problem is that this is extremely slow, as for million id-s lambda expression checks if that id is in a list of 100,000 entries. Is there a faster way to accomplish this?

I recommend you see When should I ever want to use apply

Use Series.isin with Series.astype :

 df['Segment'] = df['ID'].isin(special).astype(int)

We can also use Series.view :

df['Segment'] = df['ID'].isin(special).view('uint8')

or numpy.where

df['Segment'] = np.where(df['ID'].isin(special),1 ,0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM