简体   繁体   中英

Bulk insert into vertica using Python

I am using python to transfer data (~8 million rows) from oracle to vertica. I wrote a python script which transfers the data in 2 hours, but I am looking for ways to increase the transfer speed.

Process I am using :

  • Connect to Oracle
  • Pull the data into a dataframe (pandas)
  • Iterate over the rows in the dataframe one by one and insert into vertica (cursor.execute), I wanted to use the dataframe.to_sql method, but this method is limited to only couple of databases

Has anybody used a better way (bulk inserts or any other method?) to insert data into vertica using python?

Here is the code snippet :

df = pandas.read_sql_query(sql,conn)
conn_vertica = pyodbc.connect("DSN=dsnname")
cursor = conn_vertica.cursor()

for i,row in df.iterrows():
    cursor.execute("insert into <tablename> values(?,?,?,?,?,?,?,?,?)",row.values[0],row.values[1],row.values[2],row.values[3],row.values[4],row.values[5],row.values[6],row.values[7],row.values[8])

cursor.close()
conn_vertica.commit()
conn_vertica.close()

来自vertica-python代码https://github.com/uber/vertica-python/blob/master/vertica_python/vertica/cursor.py

with open("/tmp/file.csv", "rb") as fs: cursor.copy("COPY table(field1,field2) FROM STDIN DELIMITER ',' ENCLOSED BY '\\"'", fs, buffer_size=65536)

Doing single row inserts into Vertica is very inefficient. You need to load in batches.

The way we do it is using the COPY command, here is an example:

COPY mytable (firstcolumn, secondcolumn) FROM STDIN DELIMITER ',' ENCLOSED BY '"';

Have you considered using an existing library, for example vertica-python

Check out this link to Vertica's docs for more info on COPY options

In case you want to load a dataframe instead of the csv file into a Vertica table you can use this command:

from vertica_python import connect

db_connection = connect(host = 'hostname'
                       ,port = 5433
                       ,user = 'user', password = 'password'
                       ,database = 'db_name'
                       ,unicode_error = 'replace')

cursor = db_connection.cursor()    

cursor.copy("COPY table_name (field1, field2, ...) from stdin DELIMITER ','", \
            df.to_csv(header=None, index=False)\
           )

This part below is that makes the difference, it converts a dataframe in the memory into comma separated lines of strings that copy command can read:

df.to_csv(header=None, index=False)

It works very fast.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM