I'm trying to take the data in two different excel workbooks, each with only one sheet, and join or merge them together. The first is a sheet with about 282,000 rows of data, and I'm merging a second sheet with about 13,000 rows of data to it via a common column. It's a one to many join. The code I have currently works, but it takes about 2.5 hours to run, and I feel like there should be a way to make it more efficient. Below is the code I have:
import pandas
df1 = pandas.read_excel('file1.xlsx')
df2 = pandas.read_excel('file2.xlsx')
final_file = pandas.merge(df1, df2, left_on='OWNER', right_on='ENTITY')
final_file.to_excel('file3.xlsx', index=False)
So how can I make this run faster? Should I be using something other than pandas?
EDIT: So what takes so long is the final_file.to_excel I think. Is there a different/better way to write the merged data? Maybe writing it to a new sheet in df1?
df1
Owner Prop Decimal
AND15 1031 0.00264
AND15 1032 0.03461
AND16 1037 0.00046
df2
Entity Address Fax
AND15 Fake 123 555-555-5555
AND16 Fake 456 555-555-5544
final_file
Owner Prop Decimal Entity Address Fax
AND15 1031 0.00264 AND15 Fake 123 555-555-5555
AND15 1032 0.03461 AND15 Fake 123 555-555-5555
AND16 1037 0.00046 AND16 Fake 456 555-555-5544
Etc on the data. So it's matching Owner and Entity, and then adding the columns from df2 onto the end of matched rows in df1.
EDIT 2: It seems that trying to write the result to .xlsx is the issue, and I guess I'm running out of RAM on the pc. Doing final_file.to_csv takes it less than a minute. Lesson learned I guess.
Thanks,
Ethan
Below code will take lesser time to append and export.
1.Append the df1 with df2 and then export it into csv.
Main_df = df1.append(df2)
Note :- Remove header of that specific df which ever you going to append.
It sounds as if the importing of data is the bottleneck. I would try the below threads to speed up the imports:
Quick Test of Pandas Merge Speed using similar Len dimensions:
import time
import pandas as pd
import numpy as np
df1_test = pd.DataFrame.from_items(zip(["Col1","Col2","Col3"], [np.arange(273882),np.arange(273882),np.arange(273882)]))
df2_test = pd.DataFrame.from_items(zip(["Col1","Col2","Col3"], [np.arange(13098),np.arange(13098),np.arange(13098)]))
Time merge of dataframes
startTime = time.time(); df3_test = pd.merge(df1_test, df2_test, left_on='Col1', right_on='Col1'); print ('The script took {0} second !'.format(time.time() - startTime))
The script took 0.0390000343323 second !
You could Try this across your import sections, merge sections and write sections of your code and optimise this section accordingly.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.