简体   繁体   English

使用 Python 批量插入 vertica

[英]Bulk insert into vertica using Python

I am using python to transfer data (~8 million rows) from oracle to vertica.我正在使用 python 将数据(约 800 万行)从 oracle 传输到 vertica。 I wrote a python script which transfers the data in 2 hours, but I am looking for ways to increase the transfer speed.我写了一个 python 脚本,它在 2 小时内传输数据,但我正在寻找提高传输速度的方法。

Process I am using :我正在使用的过程:

  • Connect to Oracle连接到甲骨文
  • Pull the data into a dataframe (pandas)将数据拉入数据帧(熊猫)
  • Iterate over the rows in the dataframe one by one and insert into vertica (cursor.execute), I wanted to use the dataframe.to_sql method, but this method is limited to only couple of databases逐行遍历dataframe中的行并插入vertica(cursor.execute),我想使用dataframe.to_sql方法,但是这个方法仅限于几个数据库

Has anybody used a better way (bulk inserts or any other method?) to insert data into vertica using python?有没有人使用更好的方法(批量插入或任何其他方法?)使用 python 将数据插入 vertica?

Here is the code snippet :这是代码片段:

df = pandas.read_sql_query(sql,conn)
conn_vertica = pyodbc.connect("DSN=dsnname")
cursor = conn_vertica.cursor()

for i,row in df.iterrows():
    cursor.execute("insert into <tablename> values(?,?,?,?,?,?,?,?,?)",row.values[0],row.values[1],row.values[2],row.values[3],row.values[4],row.values[5],row.values[6],row.values[7],row.values[8])

cursor.close()
conn_vertica.commit()
conn_vertica.close()

来自vertica-python代码https://github.com/uber/vertica-python/blob/master/vertica_python/vertica/cursor.py

with open("/tmp/file.csv", "rb") as fs: cursor.copy("COPY table(field1,field2) FROM STDIN DELIMITER ',' ENCLOSED BY '\\"'", fs, buffer_size=65536)

Doing single row inserts into Vertica is very inefficient.在 Vertica 中执行单行插入非常低效。 You need to load in batches.您需要批量加载。

The way we do it is using the COPY command, here is an example:我们这样做的方法是使用 COPY 命令,这是一个示例:

COPY mytable (firstcolumn, secondcolumn) FROM STDIN DELIMITER ',' ENCLOSED BY '"';

Have you considered using an existing library, for example vertica-python您是否考虑过使用现有的库,例如vertica-python

Check out this link to Vertica's docs for more info on COPY options查看此链接指向 Vertica 的文档,了解有关 COPY 选项的更多信息

In case you want to load a dataframe instead of the csv file into a Vertica table you can use this command:如果您想将数据帧而不是 csv 文件加载到 Vertica 表中,您可以使用以下命令:

from vertica_python import connect

db_connection = connect(host = 'hostname'
                       ,port = 5433
                       ,user = 'user', password = 'password'
                       ,database = 'db_name'
                       ,unicode_error = 'replace')

cursor = db_connection.cursor()    

cursor.copy("COPY table_name (field1, field2, ...) from stdin DELIMITER ','", \
            df.to_csv(header=None, index=False)\
           )

This part below is that makes the difference, it converts a dataframe in the memory into comma separated lines of strings that copy command can read:下面这部分是有区别的,它将内存中的数据帧转换为复制命令可以读取的逗号分隔的字符串行:

df.to_csv(header=None, index=False)

It works very fast.它的工作速度非常快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM