简体   繁体   中英

Fastest way to use isin in pandas

I have two csv with ID columns, where the IDs in the first csv are a subset of the IDs in the second csv. In order to save space, after reading in the first csv, I'm trying to read in only the rows in the second csv that appear in the first csv like so:

chunker = pd.read_csv(t_path)

df = pd.DataFrame()
for chunk in chunker:
    # keep_ids is a series of ids from previous table
    temp = chunk[chunk['Id'].isin(keep_ids)]
    df = df.append(temp, ignore_index=True)
df.reset_index()

The files that I'm dealing with are as large as 30 gigs so this can be a tad slow. Is there a quicker way to find the proper id, possibly using indexes?

Edit 1: Could it be fast to set the index of the chunk equal to the id column and then only keep rows that successfully merge with keep_ids?

Maybe something like that :

chunker = pd.read_csv(t_path, iterator=True, chunksize=1000)
df = pd.concat(chunk[chunk['Id'].isin(keep_ids) for chunk in chunker ])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM