Fastest way to use isin in pandas

Question

I have two csv with ID columns, where the IDs in the first csv are a subset of the IDs in the second csv. In order to save space, after reading in the first csv, I'm trying to read in only the rows in the second csv that appear in the first csv like so:

chunker = pd.read_csv(t_path)

df = pd.DataFrame()
for chunk in chunker:
    # keep_ids is a series of ids from previous table
    temp = chunk[chunk['Id'].isin(keep_ids)]
    df = df.append(temp, ignore_index=True)
df.reset_index()

The files that I'm dealing with are as large as 30 gigs so this can be a tad slow. Is there a quicker way to find the proper id, possibly using indexes?

Edit 1: Could it be fast to set the index of the chunk equal to the id column and then only keep rows that successfully merge with keep_ids?

Answer 1

Maybe something like that :

chunker = pd.read_csv(t_path, iterator=True, chunksize=1000)
df = pd.concat(chunk[chunk['Id'].isin(keep_ids) for chunk in chunker ])

Fastest way to use isin in pandas

Question

1 answers

solution1
1 2014-06-17 19:27:12

Fastest way to use isin in pandas

Question

1 answers

solution1 1 2014-06-17 19:27:12

solution1
1 2014-06-17 19:27:12