简体   繁体   中英

How to speed up pandas to_sql

I am trying to upload data to a MS Azure Sql database using pandas to_sql and it takes very long. I often have to run it before I go to bed and wake up in the morning and it is done but has taken several hours and if there is an error that comes up I am not able to address it. Here is the code I have:

params = urllib.parse.quote_plus(
'Driver=%s;' % driver +
'Server=%s,1433;' % server +
'Database=%s;' % database +
'Uid=%s;' % username +
'Pwd={%s};' % password +
'Encrypt=yes;' +
'TrustServerCertificate=no;'
)

conn_str = 'mssql+pyodbc:///?odbc_connect=' + params
engine = create_engine(conn_str)

@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True
        cursor.commit()
        
connection = engine.connect()
connection

Then I run this command for the sql ingestion:

master_data.to_sql('table_name', engine, chunksize=500, if_exists='append', method='multi',index=False)

I have played around with the chunksize and the sweet spot seems to be 100, which isn't fast enough considering I am usually trying to upload 800,000-2,000,000 records at a time. If I increase it beyond that I will get an error which seems to only be related to the chunk size.

OperationalError: (pyodbc.OperationalError) ('08S01', '[08S01] [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure (0) (SQLExecDirectW)')

Not sure if you have your issue resolved but did want to provide an answer here for the benefit of providing Azure SQL Database libraries for Python specific information and some useful resources to investigate and resolve this issue, as applicable.

An example of using pyodbc to directly query an Azure SQL Database: Quickstart: Use Python to query Azure SQL Database Single Instance & Managed Instance

An example of using Pandas dataframe: How to read and write to an Azure SQL database from a Pandas dataframe

main.py

"""Read write to Azure SQL database from pandas"""
import pyodbc
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

# 1. Constants
AZUREUID = 'myuserid'                                    # Azure SQL database userid
AZUREPWD = '************'                                # Azure SQL database password
AZURESRV = 'shareddatabaseserver.database.windows.net'   # Azure SQL database server name (fully qualified)
AZUREDB = 'Pandas'                                      # Azure SQL database name (if it does not exit, pandas will create it)
TABLE = 'DataTable'                                      # Azure SQL database table name
DRIVER = 'ODBC Driver 13 for SQL Server'                 # ODBC Driver

def main():
"""Main function"""

# 2. Build a connectionstring
connectionstring = 'mssql+pyodbc://{uid}:{password}@{server}:1433/{database}?driver={driver}'.format(
    uid=AZUREUID,
    password=AZUREPWD,
    server=AZURESRV,
    database=AZUREDB,
    driver=DRIVER.replace(' ', '+'))

# 3. Read dummydata into dataframe 
df = pd.read_csv('./data/data.csv')

# 4. Create SQL Alchemy engine and write data to SQL
engn = create_engine(connectionstring)
df.to_sql(TABLE, engn, if_exists='append')

# 5. Read data from SQL into dataframe
query = 'SELECT * FROM {table}'.format(table=TABLE)
dfsql = pd.read_sql(query, engn)

print(dfsql.head())


if __name__ == "__main__":
    main()

And finally, the following resources should assist in comparing specific implementations, with performance issues, with the below information where the Stack Overflow thread is likely the best resource but the Monitoring and Performance tuning document is useful to investigate and mitigate ay server-side performance issues, etc.

Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC Monitoring and performance tuning in Azure SQL Database and Azure SQL Managed Instance

Regards, Mike

params = urllib.parse.quote_plus(
    'Driver=%s;' % driver +
    'Server=%s,1433;' % server +
    'Database=%s;' % database +
    'Uid=%s;' % username +
    'Pwd={%s};' % password +
    'Encrypt=yes;' +
    'TrustServerCertificate=no;'
    )


conn_str = 'mssql+pyodbc:///?odbc_connect=' + params
engine = create_engine(conn_str)

@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True
        cursor.commit()
        
connection = engine.connect()
connection

Database ingestion is done with this next line. I had issues with chunksize before but fixed it by adding the method and index.

ingest_data.to_sql('db_table_name', engine, if_exists='append',chunksize=100000, method=None,index=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM