简体   繁体   中英

Speed up selection by a combination of two columns — Pandas

Currently, I am trying to select rows from a large dataframe (1.5 million rows), called active, by a combination of two columns from another dataframe, called passive, which has about 30,000 rows. If a combination of two columns in the active table matches the combination of two columns in the passive table, I select the row from the active table.

Here is the code:
active.loc[(active['userid']+active['orgcity']).isin(passive.userid+passive.city)]

However, this process is taking a long time. I think it should already be an improvement over iteration or pd.apply. Are there any other ways to speed this up?

  1. You can put each row entry [passive.userid, passive.city] in a set. That way your "in B" check becomes an O(1) lookup instead of an O(n) and this gives you a 30,000x speedup.
  2. Then use the apply() function in pandas to vectorize the lookup. Here's an example of apply(). 在此处输入图片说明

You can find more details here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM