I am using python to transfer data (~8 million rows) from oracle to vertica. I wrote a python script which transfers the data in 2 hours, but I am looking for ways to increase the transfer speed.
Process I am using :
dataframe.to_sql
method, but this method is limited to only couple of databasesHas anybody used a better way (bulk inserts or any other method?) to insert data into vertica using python?
Here is the code snippet :
df = pandas.read_sql_query(sql,conn)
conn_vertica = pyodbc.connect("DSN=dsnname")
cursor = conn_vertica.cursor()
for i,row in df.iterrows():
cursor.execute("insert into <tablename> values(?,?,?,?,?,?,?,?,?)",row.values[0],row.values[1],row.values[2],row.values[3],row.values[4],row.values[5],row.values[6],row.values[7],row.values[8])
cursor.close()
conn_vertica.commit()
conn_vertica.close()
来自vertica-python
代码https://github.com/uber/vertica-python/blob/master/vertica_python/vertica/cursor.py
with open("/tmp/file.csv", "rb") as fs: cursor.copy("COPY table(field1,field2) FROM STDIN DELIMITER ',' ENCLOSED BY '\\"'", fs, buffer_size=65536)
Doing single row inserts into Vertica is very inefficient. You need to load in batches.
The way we do it is using the COPY command, here is an example:
COPY mytable (firstcolumn, secondcolumn) FROM STDIN DELIMITER ',' ENCLOSED BY '"';
Have you considered using an existing library, for example vertica-python
Check out this link to Vertica's docs for more info on COPY options
In case you want to load a dataframe instead of the csv file into a Vertica table you can use this command:
from vertica_python import connect
db_connection = connect(host = 'hostname'
,port = 5433
,user = 'user', password = 'password'
,database = 'db_name'
,unicode_error = 'replace')
cursor = db_connection.cursor()
cursor.copy("COPY table_name (field1, field2, ...) from stdin DELIMITER ','", \
df.to_csv(header=None, index=False)\
)
This part below is that makes the difference, it converts a dataframe in the memory into comma separated lines of strings that copy command can read:
df.to_csv(header=None, index=False)
It works very fast.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.