What is the fastest way to save a pandas dataframe in a MySQL Database

Question

I am writing a code in python to generate and update a mysql table based on another mysql table from another database.

My code does something like this:

For dates in a date_range:

Query a quantity in db1 between 2 dates
Do some work in pandas => df
Delete in db2 the rows with the ids that are in df
save df with df.to_sql

The operation 1-2 are taking less than 2s when 3-4 can take up to 10s. Step 4 takes 4 more times than 3. How can I improve my code to make the writing process more efficient

I have already chunked the df for step 3 and 4. I have added method=multi in .to_sql (this did not work at all). I was wondering if we could do better;

with db.begin() as con:
    for chunked in chunks(df.id.tolist(), 1000):
        _ = con.execute(""" DELETE FROM table where id 
                            in {} """.format(to_tuple(chunked)))
    for chunked in chunks(df.id.tolist(), 100000):        
        df.query("id in @chunked").to_sql('table', con, index=False, 
        if_exists='append')

thanks for your help

Answer 1

I have found df.to_sql to be a very slow. One way that I've gotten around it this issue is by outputting the dataframe into a csv file with df.to_csv and using BCP in to bluk insert the data in the csv into the table then deleting the csv file once its done with the insertion. You can use subprocess to run BCP in a python script.

What is the fastest way to save a pandas dataframe in a MySQL Database

Question

1 answers

solution1
0 2020-03-06 18:10:15

What is the fastest way to save a pandas dataframe in a MySQL Database

Question

1 answers

solution1 0 2020-03-06 18:10:15

solution1
0 2020-03-06 18:10:15