简体   繁体   中英

Is there a faster way to move millions of rows from Excel to a SQL database using Python?

I am a financial analyst with about two month's experience with Python, and I am working on a project using Python and SQL to automate the compilation of a report. The process involves accessing a changing number of Excel files saved in a share drive, pulling two tabs from each (summary and quote) and combining the datasets into two large "Quote" and "Summary" tables. The next step is to pull various columns from each, combine, calculate, etc.

The problem is that the dataset ends up being 3.4mm rows and around 30 columns. The program I wrote below works, but it took 40 minutes to work through the first part (creating the list of dataframes) and another 4.5 hours to create the database and export the data, not to mention using a LOT of memory.

I know there must be a better way to accomplish this, but I don't have a CS background. Any help would be appreciated.

import os
import pandas as pd
from datetime import datetime
import sqlite3
from sqlalchemy import create_engine
from playsound import playsound

reportmonth = '2020-08'
month_folder = r'C:\syncedSharePointFolder'
os.chdir(month_folder)

starttime = datetime.now()
print('Started', starttime)

c = 0

tables = list()
quote_combined = list()
summary_combined = list()

# Step through files in synced Sharepoint directory, select the files with the specific
# name format. For each file, parse the file name and add to 'tables' list, then load
# two specific tabs as pandas dataframes.  Add two columns, format column headers, then 
# add each dataframe to the list of dataframes. 

for xl in os.listdir(month_folder):
    if '-Amazon' in xl:
        ttime = datetime.now()
        table_name = str(xl[11:-5])
        tables.append(table_name)
        quote_sheet = pd.read_excel(xl, sheet_name='-Amazon-Quote')
        summary_sheet = pd.read_excel(xl, sheet_name='-Amazon-Summary')
        
        quote_sheet.insert(0,'reportmonth', reportmonth)
        summary_sheet.insert(0,'reportmonth', reportmonth)
        quote_sheet.insert(0,'source_file', table_name)
        summary_sheet.insert(0,'source_file', table_name)
        quote_sheet.columns = quote_sheet.columns.str.strip()
        quote_sheet.columns = quote_sheet.columns.str.replace(' ', '_')
        summary_sheet.columns = summary_sheet.columns.str.strip()
        summary_sheet.columns = summary_sheet.columns.str.replace(' ', '_')
        
        quote_combined.append(quote_sheet)
        summary_combined.append(summary_sheet)
        
        c = c + 1
        
        print('Step', c, 'complete: ', datetime.now() - ttime, datetime.now() - starttime)

# Concatenate the list of dataframes to append one to another.  
# Totals about 3.4mm rows for August

totalQuotes = pd.concat(quote_combined)
totalSummary = pd.concat(summary_combined)     

# Change directory, create Sqlite database, and send the combined dataframes to database

os.chdir(r'H:\AaronS\Databases')
conn = sqlite3.connect('AMZN-Quote-files_' + reportmonth)
cur = conn.cursor()
engine = create_engine('sqlite:///AMZN-Quote-files_' + reportmonth + '.sqlite', echo=False)
sqlite_connection = engine.connect()

sqlite_table = 'totalQuotes'
sqlite_table2 = 'totalSummary'

totalQuotes.to_sql(sqlite_table, sqlite_connection, if_exists = 'replace')    
totalSummary.to_sql(sqlite_table2, sqlite_connection, if_exists = 'replace')  
     
print('Finished. It took: ', datetime.now() - starttime)
'''

Try this, Here most of the time is taken is during Loading Data from excel to Dataframe. I am not sure following script will reduce the time to within seconds but It will reduce the RAM baggage, which in turn could speed up the process. It will potentially reduce the time by at least 5-10 minutes. Since I have no access to data I cannot be sure. But you should try this

import os
import pandas as pd
from datetime import datetime
import sqlite3
from sqlalchemy import create_engine
from playsound import playsound

os.chdir(r'H:\AaronS\Databases')
conn = sqlite3.connect('AMZN-Quote-files_' + reportmonth)

engine = create_engine('sqlite:///AMZN-Quote-files_' + reportmonth + '.sqlite', echo=False)
sqlite_connection = engine.connect()

sqlite_table = 'totalQuotes'
sqlite_table2 = 'totalSummary'


reportmonth = '2020-08'
month_folder = r'C:\syncedSharePointFolder'
os.chdir(month_folder)

starttime = datetime.now()
print('Started', starttime)


c = 0
tables = list()

for xl in os.listdir(month_folder):
    if '-Amazon' in xl:
        ttime = datetime.now()
        
        table_name = str(xl[11:-5])
        tables.append(table_name)
        
        quote_sheet = pd.read_excel(xl, sheet_name='-Amazon-Quote')
        summary_sheet = pd.read_excel(xl, sheet_name='-Amazon-Summary')
        
        quote_sheet.insert(0,'reportmonth', reportmonth)
        summary_sheet.insert(0,'reportmonth', reportmonth)
        
        quote_sheet.insert(0,'source_file', table_name)
        summary_sheet.insert(0,'source_file', table_name)
        
        quote_sheet.columns = quote_sheet.columns.str.strip()
        quote_sheet.columns = quote_sheet.columns.str.replace(' ', '_')
        
        summary_sheet.columns = summary_sheet.columns.str.strip()
        summary_sheet.columns = summary_sheet.columns.str.replace(' ', '_')
        
        quote_sheet.to_sql(sqlite_table, sqlite_connection, if_exists = 'append')    
        summary_sheet.to_sql(sqlite_table2, sqlite_connection, if_exists = 'append')  
        
        c = c + 1
        print('Step', c, 'complete: ', datetime.now() - ttime, datetime.now() - starttime)

I see a few things you could do. Firstly, since your first step is just to transfer the data to your SQL DB, you don't necessarily need to append all the files to each other. You can just attack the problem one file at a time (which means you can multiprocess,) - then, whatever computations need to be completed. can come later, This will also result in you cutting down your RAM usage since if you have 10 files in your folder. you aren't loading all 10 up at the same time.
I would recommend the following:

  1. Construct an array of filenames that you need to access
  2. Write a wrapper function that can take a filename, open + parse the file, and write the contents to your MySQL DB
  3. Use the Python multiprocessing.Pool class to process them simultaneously. If you run 4 processes, for example, your task becomes 4 times faster, If you need to derive computations from this data and hence need to aggregate it. please do this once the data's in the MySQL DB. This will be way faster.
  4. If you need to define some computations based on the aggregate data, do it now, in the MySQL DB. SQL is an incredibly powerful language, and there's a command out there for practically everything!

I've added in a short code snippet to show you what I'm talking about:)

from multiprocessing import Pool

PROCESSES = 4

FILES = []

def _process_file(filename):
    print("Processing: "+filename)

pool = Pool(PROCESSES)
pool.map(_process_file, FILES)

SQL clarification: You don't need an independent table for every file you move to SQL! You can create a table based on a given schema , and then add the data from ALL your files to that one table, row by row. This is essentially what the function you use to go from DataFrame to table does, but it creates 10 different tables. You can look at some examples on inserting a row into a table here .

However , in the specific use case that you have, setting the if_exists parameter to "append" should work, as you've mentioned in your comment. I just added the earlier references in because you mentioned that you're fairly new to Python, and a lot of my friends in the finance industry have found gaining a slightly more nuanced understanding of SQL to be extremely useful.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM