I'm currently trying to tune the performance of a few of my scripts a little bit and it seems that the bottleneck is always the actual insert into the DB (=MSSQL) with the pandas to_sql function.
One factor which plays into this is mssql's parameter limit of 2100.
I establish my connection with sqlalchemy (with the mssql + pyodbc flavour):
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params, fast_executemany=True)
When inserting I use a chunksize (so I stay below the parameter limit and method="multi"):
dataframe_audit.to_sql(name="Audit", con=connection, if_exists='append', method="multi",
chunksize=50, index=False)
This leads to the following (unfortunately extremely inconsistent) performance:
I'm not sure what to think of this exactly:
Any ideas to get a better insert performance for my DataFrames?
If you are using the most recent version of pyodbc with ODBC Driver 17 for SQL Server and fast_executemany=True
in your SQLAlchemy create_engine
call then you should be using method=None
(the default) in your to_sql
call. That will allow pyodbc to use an ODBC parameter array and give you the best performance under that setup. You will not hit the SQL Server stored procedure limit of 2100 parameters (unless your DataFrame has ~2100 columns). The only limit you would face would be if your Python process does not have sufficient memory available to build the entire parameter array before sending it to the SQL Server.
The method='multi'
option for to_sql
is only applicable to pyodbc when using an ODBC driver that does not support parameter arrays (eg, FreeTDS ODBC). In that case fast_executemany=True
will not help and may actually cause errors.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.