有没有更快的方法使用 python 将 dataframe 插入到 SQL？

Question

We have two parts to get final data frame into SQL.我们有两个部分将最终数据帧放入 SQL。

downlaoding from datasets from Azure and transforming using python.从 Azure 的数据集下载并使用 python 进行转换。
Uploading transformed data into Azure and then inserting the final dataframe into SQL将转换后的数据上传到 Azure，然后将最终的 dataframe 插入到 SQL

Downloading, transforming and uploading takes 5 mins but insertion to SQL is taking quite long time.下载、转换和上传需要 5 分钟，但插入到 SQL 需要相当长的时间。 I used below code for faster insertion.我使用下面的代码来加快插入速度。

server = 'XXXX.database.windows.net' 
database = 'XXX' 
username = 'XXX' 
password = 'XXXX' 
driver= '{ODBC Driver 17 for SQL Server}' 
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)



params = urllib.parse.quote_plus('DRIVER='+driver+
                      ';SERVER='+server+
                      ';PORT=1433;DATABASE='+database+
                      ';UID='+username+
                      ';PWD='+ password)
    
engine = 
sa.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params),fast_executemany=True)
conn = engine.connect()



with engine.connect() as connection:
    
    try:  
      
         df_copy.to_sql('XXXX',connection,if_exists = 'append',index=False,chunksize=500)

    except SQLAlchemyError as e:
     error = str(e.__dict__['orig'])
     print(error)
    

conn.close()

Final data frame contains 97000 rows with 127 columns.最终数据框包含 97000 行和 127 列。

SQL Server configuration: Purchased Azure SQL 10 DTUS 250GB of storage. SQL 服务器配置：购买 Azure SQL 10 DTUS 250GB 存储。

The error is错误是

Exception has occurred: OperationalError (pyodbc.OperationalError) ('08S01', '[08S01] [Microsoft][ODBC Driver 17 for SQL Server]TCP Provider: An existing connection was forcibly closed by the remote host.\r\n (10054) (SQLExecute); [08S01] [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure (10054)')发生异常：OperationalError (pyodbc.OperationalError) ('08S01', '[08S01] [Microsoft][ODBC Driver 17 for SQL Server]TCP Provider: 现有连接被远程主机强行关闭。\r\n (10054 ) (SQLExecute); [08S01] [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure (10054)')

I have also used connect_args={'connect_timeout': 2400} inside create engine but after 40-50 mins we are receiving the same error mgs.我还在创建引擎中使用了connect_args={'connect_timeout': 2400}但在 40-50 分钟后我们收到了相同的错误消息。 I think 50 mins for 97k records is quite long time.我认为 97k 记录 50 分钟是相当长的时间。 Any way I could improve the process?我有什么办法可以改进这个过程吗？ Also, I'm currently running on my local machine which has 16GB ram and 12th Gen Intel(R) Core(TM) i7-1265U 1.80 GHz processor.此外，我目前正在我的本地机器上运行，该机器具有 16GB 内存和第 12 代 Intel(R) Core(TM) i7-1265U 1.80 GHz 处理器。 Also, we use Jenkins for deployment.此外，我们使用 Jenkins 进行部署。 Will there be any faster performance if we test it on Jenkins?如果我们在Jenkins上测试会不会有更快的性能？

Answer 1

hello there you should try to specify your chunksize in your call df.to_sql(engine, connect(), index=False, if_exists='append', method=None, chunksize = 50000)您好，您应该尝试在调用中指定块大小 df.to_sql(engine, connect(), index=False, if_exists='append', method=None, chunksize = 50000)

有没有更快的方法使用 python 将 dataframe 插入到 SQL？

问题描述

1 个解决方案

解决方案1
0 2022-11-16 09:13:05

有没有更快的方法使用 python 将 dataframe 插入到 SQL？

问题描述

1 个解决方案

解决方案1 0 2022-11-16 09:13:05

解决方案1
0 2022-11-16 09:13:05