简体   繁体   中英

Joining two excel sheets with Python using pandas

I'm trying to take the data in two different excel workbooks, each with only one sheet, and join or merge them together. The first is a sheet with about 282,000 rows of data, and I'm merging a second sheet with about 13,000 rows of data to it via a common column. It's a one to many join. The code I have currently works, but it takes about 2.5 hours to run, and I feel like there should be a way to make it more efficient. Below is the code I have:

import pandas

df1 = pandas.read_excel('file1.xlsx')
df2 = pandas.read_excel('file2.xlsx')

final_file = pandas.merge(df1, df2, left_on='OWNER', right_on='ENTITY')
final_file.to_excel('file3.xlsx', index=False)

So how can I make this run faster? Should I be using something other than pandas?

EDIT: So what takes so long is the final_file.to_excel I think. Is there a different/better way to write the merged data? Maybe writing it to a new sheet in df1?

Owner  Prop    Decimal
AND15  1031    0.00264
AND15  1032    0.03461
AND16  1037    0.00046

Entity  Address    Fax
AND15   Fake 123   555-555-5555
AND16   Fake 456   555-555-5544

Owner  Prop    Decimal   Entity  Address    Fax
AND15  1031    0.00264   AND15   Fake 123   555-555-5555
AND15  1032    0.03461   AND15   Fake 123   555-555-5555
AND16  1037    0.00046   AND16   Fake 456   555-555-5544

Etc on the data. So it's matching Owner and Entity, and then adding the columns from df2 onto the end of matched rows in df1.

EDIT 2: It seems that trying to write the result to .xlsx is the issue, and I guess I'm running out of RAM on the pc. Doing final_file.to_csv takes it less than a minute. Lesson learned I guess.



Below code will take lesser time to append and export.

1.Append the df1 with df2 and then export it into csv.

Main_df = df1.append(df2)

Note :- Remove header of that specific df which ever you going to append.

It sounds as if the importing of data is the bottleneck. I would try the below threads to speed up the imports:

Quick Test of Pandas Merge Speed using similar Len dimensions:

import time
import pandas as pd
import numpy as np
df1_test = pd.DataFrame.from_items(zip(["Col1","Col2","Col3"], [np.arange(273882),np.arange(273882),np.arange(273882)]))
df2_test = pd.DataFrame.from_items(zip(["Col1","Col2","Col3"], [np.arange(13098),np.arange(13098),np.arange(13098)]))

Time merge of dataframes

startTime = time.time(); df3_test = pd.merge(df1_test, df2_test, left_on='Col1', right_on='Col1'); print ('The script took {0} second !'.format(time.time() - startTime))

The script took 0.0390000343323 second !

You could Try this across your import sections, merge sections and write sections of your code and optimise this section accordingly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM