简体   繁体   English

使用 Uber 的 vertica-python 包使用 Python 批量插入 Vertica

[英]Bulk insert into Vertica using Python using Uber's vertica-python package

Question 1 of 2第 1 题(共 2 题)

I'm trying to import data from CSV file to Vertica using Python, using Uber's vertica-python package.我正在尝试使用 Uber 的 vertica-python 包,使用 Python 将数据从 CSV 文件导入 Vertica。 The problem is that whitespace-only data elements are being loaded into Vertica as NULLs;问题是只有空白的数据元素作为 NULL 加载到 Vertica 中; I want only empty data elements to be loaded in as NULLs, and non-empty whitespace data elements to be loaded in as whitespace instead.我只想将空数据元素作为 NULL 加载,而不是将非空空白数据元素作为空白加载。

For example, the following two rows of a CSV file are both loaded into the database as ('1','abc',NULL,NULL), whereas I want the second one to be loaded as ('1','abc',' ',NULL).例如,CSV 文件的以下两行都作为 ('1','abc',NULL,NULL) 加载到数据库中,而我希望将第二行加载为 ('1','abc' ,' ',空值)。

1,abc,,^M
1,abc,  ,^M

Here is the code:这是代码:

# import vertica-python package by Uber
#   source: https://github.com/uber/vertica-python
import vertica_python

# write CSV file
filename = 'temp.csv'
data = <list of lists, e.g. [[1,'abc',None,'def'],[2,'b','c','d']]>
with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f, escapechar='\\', doublequote=False)
        writer.writerows(data)

# define query
q = "copy <table_name> (<column_names>) from stdin "\
    "delimiter ',' "\
    "enclosed by '\"' "\
    "record terminator E'\\r' "

# copy data
conn = vertica_python.connect( host=<host>,
                               port=<port>,
                               user=<user>,
                               password=<password>,
                               database=<database>,
                               charset='utf8' )
cur = conn.cursor()
with open(filename, 'rb') as f:
    cur.copy(q,  f)
conn.close()

Question 2 of 2第 2 题,共 2 题

Are there any other issues (eg character encoding) I have to watch out for using this method of loading data into Vertica?使用这种将数据加载到 Vertica 的方法时,我是否还需要注意其他任何问题(例如字符编码)? Are there any other mistakes in the code?代码中还有其他错误吗? I'm not 100% convinced it will work on all platforms (currently running on Linux; there may be record terminator issues on other platforms, for example).我不是 100% 相信它可以在所有平台上运行(目前在 Linux 上运行;例如,在其他平台上可能存在记录终止符问题)。 Any recommendations to make this code more robust would be greatly appreciated.任何使此代码更健壮的建议将不胜感激。

In addition, are there alternative methods of bulk inserting data into Vertica from Python, such as loading objects directly from Python instead of having to write them to CSV files first, without sacrificing speed?另外,是否有其他方法可以从 Python 中批量插入数据到 Vertica 中,例如直接从 Python 加载对象,而不必先将它们写入 CSV 文件,而不会牺牲速度? The data volume is large and the insert job as is takes a couple of hours to run.数据量很大,插入作业需要几个小时才能运行。

Thank you in advance for any help you can provide!预先感谢您提供的任何帮助!

The copy statement you have should perform the way you want with regards to the spaces.您拥有的 copy 语句应该按照您想要的方式执行空格。 I tested it using a very similar COPY .我使用非常相似的COPY对其进行了测试。

Edit: I missed what you were really asking with the copy, I'll leave this part in because it might still be useful for some people:编辑:我错过了您对副本的真正要求,我将保留这一部分,因为它可能对某些人仍然有用:

To fix the whitespace, you can change your copy statement:要修复空格,您可以更改复制语句:

copy <table_name> (FIELD1, FIELD2, MYFIELD3 AS FILLER VARCHAR(50), FIELD4, FIELD3 AS NVL(MYFIELD3,'') ) from stdin

By using filler, it will parse that into something like a variable which you can then assign to your actual table field using AS later in the copy.通过使用填充符,它会将其解析为变量之类的内容,然后您可以稍后在副本中使用AS将其分配给实际的表字段。

As for any gotchas... I do what you have on Solaris often.至于任何问题......我经常做你在 Solaris 上所做的事情。 The only one thing I noticed is you are setting the record terminator, not sure if this is really something you need to do depending on environment or not.我注意到的唯一一件事是您正在设置记录终止符,不确定这是否真的需要根据环境执行。 I've never had to do it switching between linux, windows and solaris.我从来不需要在 linux、windows 和 solaris 之间切换。

Also, one hint, this will return a resultset that will tell you how many rows were loaded.另外,一个提示,这将返回一个结果集,告诉您加载了多少行。 Do a fetchone() and print it out and you'll see it.执行fetchone()并将其打印出来,您就会看到它。

The only other thing I can recommend might be to use reject tables in case any rows reject.我唯一可以推荐的另一件事可能是使用拒绝表,以防任何行被拒绝。

You mentioned that it is a large job.你提到这是一项大工作。 You may need to increase your read timeout by adding 'read_timeout': 7200, to your connection or more.您可能需要通过将'read_timeout': 7200,添加到您的连接或更多来增加读取超时。 I'm not sure if None would disable the read timeout or not.我不确定 None 是否会禁用读取超时。

As for a faster way... if the file is accessible directly on the vertica node itself, you could just reference the file directly in the copy instead of doing a copy from stdin and have the daemon load it directly.至于更快的方法...如果文件可以直接在 vertica 节点本身上访问,您可以直接在副本中引用该文件,而不是copy from stdin并让守护程序直接加载它。 It's much faster and has a number of optimizations that you can do.它的速度要快得多,并且您可以进行许多优化。 You could then use apportioned load, and if you have multiple files to load you can just reference them all together in a list of files.然后您可以使用分摊加载,如果您有多个文件要加载,您可以在文件列表中一起引用它们。

It's kind of a long topic, though.不过,这是一个很长的话题。 If you have any specific questions let me know.如果您有任何具体问题,请告诉我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM