简体   繁体   English

在不使用 BULK INSERT 或 Pandas to_sql 的情况下,加快从 CSV 文件插入到 SQL Server 的速度

[英]Speed up insert to SQL Server from CSV file without using BULK INSERT or pandas to_sql

I want to put a Pandas dataframe as a whole in a table in a MS SQL Server database.我想将 Pandas 数据框作为一个整体放在 MS SQL Server 数据库的表中。 BULK INSERT is not allowed for common users like myself.像我这样的普通用户不允许批量插入。 I am using pyodbc to connect to my database.我正在使用 pyodbc 连接到我的数据库。 I am using Pandas 0.13.1.我正在使用 Pandas 0.13.1。 I read somewhere that from version 0.14 you can use the to_sql method and thus that it is unavailable for my pandas dataframe.我在某个地方读到,从 0.14 版开始,您可以使用 to_sql 方法,因此它不适用于我的 Pandas 数据框。 Therefore I used an iterator.因此我使用了迭代器。 My dataframe has 2 columns: Col1 and Col2.我的数据框有 2 列:Col1 和 Col2。

My code is working and looks like:我的代码正在运行,看起来像:

from pyodbc import connect
import pandasas pd

df = pd.read_csv('PathToMyCSVfile', sep=';', header=0)

cnxn = connect(DRIVER = '{SQL Server}', SERVER = 'MyServer', DATABASE = 'MyDatabase')
cursor = cnxn.cursor()

for index, row in df.interrows():
  cursor.execute("INSERT INTO MySchema.MyTable VALUES (?,?)", df['Col1'][index], def['Col2'][index]
  cnxn.commit()

As said, above code is working, but it is slow... What can I do to speed things up?如上所述,上面的代码正在运行,但速度很慢......我能做些什么来加快速度?

The bottleneck you face is that your code sends an INSERT statement for each row in the DataFrame.您面临的瓶颈是您的代码为 DataFrame 中的每一行发送一个 INSERT 语句。 That is, for a sample data file也就是说,对于样本数据文件

id;txt
1;alpha
2;bravo
3;charlie
4;delta
5;echo
6;foxtrot
7;golf

you would need seven (7) round-trips to the server to send the equivalent of您需要七 (7) 次往返服务器才能发送相当于

INSERT INTO MySchema.MyTable VALUES (1,'alpha')
INSERT INTO MySchema.MyTable VALUES (2,'bravo')
INSERT INTO MySchema.MyTable VALUES (3,'charlie')
...
INSERT INTO MySchema.MyTable VALUES (7,'golf')

You could speed that up significantly by using a Table Value Constructor to do the same thing in one round-trip:您可以通过使用表值构造函数在一次往返中执行相同的操作来显着加快速度:

INSERT INTO MySchema.MyTable VALUES (1,'alpha'),(2,'bravo'),(3,'charlie'), ... ,(7,'golf')

The following code does just that.下面的代码就是这样做的。 When I tested it using a file with 5000 rows, running it with rows_per_batch=1000 (the maximum) was about 100 times faster than with rows_per_batch=1 (the equivalent of your current approach).当我使用具有 5000 行的文件对其进行测试时,使用rows_per_batch=1000 (最大值)运行它比使用rows_per_batch=1 (相当于您当前的方法)快大约 100 倍。

import numpy
import pandas as pd
import pyodbc
import time


class MyDfInsert:
    def __init__(self, cnxn, sql_stub, data_frame, rows_per_batch=1000):
        # NB: hard limit is 1000 for SQL Server table value constructor
        self._rows_per_batch = 1000 if rows_per_batch > 1000 else rows_per_batch

        self._cnxn = cnxn
        self._sql_stub = sql_stub
        self._num_columns = None
        self._row_placeholders = None
        self._num_rows_previous = None
        self._all_placeholders = None
        self._sql = None

        row_count = 0
        param_list = list()
        for df_row in data_frame.itertuples():
            param_list.append(tuple(df_row[1:]))  # omit zero-based row index
            row_count += 1
            if row_count >= self._rows_per_batch:
                self._send_insert(param_list)  # send a full batch
                row_count = 0
                param_list = list()
        self._send_insert(param_list)  # send any remaining rows

    def _send_insert(self, param_list):
        if len(param_list) > 0:
            if self._num_columns is None:
                # print('[DEBUG] (building items that depend on the number of columns ...)')
                # this only happens once
                self._num_columns = len(param_list[0])
                self._row_placeholders = ','.join(['?' for x in range(self._num_columns)])
                # e.g. '?,?'
            num_rows = len(param_list)
            if num_rows != self._num_rows_previous:
                # print('[DEBUG] (building items that depend on the number of rows ...)')
                self._all_placeholders = '({})'.format('),('.join([self._row_placeholders for x in range(num_rows)]))
                # e.g. '(?,?),(?,?),(?,?)'
                self._sql = f'{self._sql_stub} VALUES {self._all_placeholders}'
                self._num_rows_previous = num_rows
            params = [int(element) if isinstance(element, numpy.int64) else element
                      for row_tup in param_list for element in row_tup]
            # print('[DEBUG]    sql: ' + repr(self._sql))
            # print('[DEBUG] params: ' + repr(params))
            crsr = self._cnxn.cursor()
            crsr.execute(self._sql, params)


if __name__ == '__main__':
    conn_str = (
        'DRIVER=ODBC Driver 11 for SQL Server;'
        'SERVER=192.168.1.134,49242;'
        'Trusted_Connection=yes;'
    )
    cnxn = pyodbc.connect(conn_str, autocommit=True)
    crsr = cnxn.cursor()
    crsr.execute("CREATE TABLE #tmp (id INT PRIMARY KEY, txt NVARCHAR(50))")

    df = pd.read_csv(r'C:\Users\Gord\Desktop\Query1.txt', sep=';', header=0)

    t0 = time.time()

    MyDfInsert(cnxn, "INSERT INTO #tmp (id, txt)", df, rows_per_batch=1000)

    print()
    print(f'Inserts completed in {time.time() - t0:.2f} seconds.')

    cnxn.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM