简体   繁体   中英

Speeding up Pandas to_sql()?

I have a 1,000,000 x 50 Pandas DataFrame that I am currently writing to a SQL table using:

df.to_sql('my_table', con, index=False)

It takes an incredibly long time. I've seen various explanations about how to speed up this process online, but none of them seem to work for MSSQL.

  1. If I try the method in:

    Bulk Insert A Pandas DataFrame Using SQLAlchemy

    then I get a no attribute copy_from error.

  2. If I try the multithreading method from:

    http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html

    then I get a QueuePool limit of size 5 overflow 10 reach, connection timed out error.

Is there any easy way to speed up to_sql() to an MSSQL table? Either via BULK COPY or some other method, but entirely from within Python code?

I've used ctds to do a bulk insert that's a lot faster with SQL server. In example below, df is the pandas DataFrame. The column sequence in the DataFrame is identical to the schema for mydb.

import ctds

conn = ctds.connect('server', user='user', password='password', database='mydb')
conn.bulk_insert('table', (df.to_records(index=False).tolist()))

even I had the same issue so I applied sqlalchemy with fast execute many.

from sqlalchemy import event, create_engine
engine = create_egine('connection_string_with_database')
@event.listens_for(engine, 'before_cursor_execute')
def plugin_bef_cursor_execute(conn, cursor, statement, params, context,executemany):
   if executemany:
       cursor.fast_executemany = True  # replace from execute many to fast_executemany.
       cursor.commit()

always make sure that the given function should be present after the engine variable and before cursor execute.

conn = engine.execute()
df.to_sql('table', con=conn, if_exists='append', index=False) # for reference go to the pandas to_sql documentation.

in pandas 0.24 you can use method ='multi' with chunk size of 1000 which is the sql server limit

chunksize=1000, method='multi'

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method

New in version 0.24.0.

The parameter method controls the SQL insertion clause used. Possible values are:

None: Uses standard SQL INSERT clause (one per row). 'multi': Pass multiple values in a single INSERT clause. It uses a special SQL syntax not supported by all backends. This usually provides better performance for analytic databases like Presto and Redshift, but has worse performance for traditional SQL backend if the table contains many columns. For more information check the SQLAlchemy documention.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM