简体   繁体   English

使用psql \\ copy导入csv,修改数据

[英]importing csv with psql \copy, modify data as it comes

I frequently need to import a csv into postgres and usually use the \\copy command from psql . 我经常需要将csv导入postgres,并且通常使用psql\\copy命令。 It usually looks something like this 通常看起来像这样

\copy tbl FROM import.csv CSV 

I have two common problems which I feel might have a similar answer. 我有两个常见的问题,我认为可能会有类似的答案。

  1. parsing date strings as they come in into a TIMESTAMP field 将日期字符串输入到TIMESTAMP字段中时进行解析
  2. empty strings in INTEGER fields causing errors INTEGER字段中的空字符串导致错误

In both cases there is minor modification that needs to be done, but my current solution is to create loading tables with all fields as type VARCHAR , then creating another table with the correct schema. 在这两种情况下,都需要进行较小的修改,但是我当前的解决方案是创建所有字段都为VARCHAR类型的加载表,然后使用正确的架构创建另一个表。 I then use \\copy and 然后,我使用\\copy

CREATE TABLE loading_tbl (
    datefield VARCHAR,
    integerfield VARCHAR
);    

CREATE TABLE tbl (
    datefield TIMESTAMP,
    integerfield INTEGER
);

\copy loading_tbl FROM import.csv CSV

INSERT INTO tbl (datefield, integerfield)
SELECT
    to_timestamp(datefield, 'YYYY-Mon, DAY HH24:MI a.m'),
    integerfield::INTEGER
FROM loading_tbl;

DROP TABLE loading_tbl;

Is this the best method or is there a simpler way? 这是最好的方法还是更简单的方法? It is kind of a pain to create two tables especially as the number of fields increases. 创建两个表有点麻烦,尤其是随着字段数的增加。

Another option would be to use a scripting language to do the ETL . 另一种选择是使用脚本语言来执行ETL It may be easier to reason about and/or have less overhead, depending on your exact needs. 根据您的确切需求,可能更容易推理和/或开销更少。

For example, you could use Python 's csv and psycopg2 modules to interact with the CSV file and Postgres database, respectively, performing any ETL that's necessary. 例如,您可以使用Pythoncsvpsycopg2模块分别与CSV文件和Postgres数据库进行交互,执行任何必要的ETL psycopg2 will in general handle the timestamp string to actual Postgres timestamp conversion for you (assuming it's a recognized timestamp string, of which there are a variety of types). 通常, psycopg2将为您处理时间戳字符串到实际的Postgres时间戳转换(假定它是一个公认的时间戳字符串,其中有多种类型)。

For cases where there are fields in the CSV which are integers in Postgres but empty strings in the CSV , in the Python script, you can check for empty-string values and assign them to be NULL in Postgres instead. 对于CSV中的字段(在Postgres中为整数但在CSV中为空字符串)的情况,在Python脚本中,您可以检查空字符串值,然后在Postgres中将其分配为NULL

I have used Python to do something like this very recently with good results. 我最近用Python做过类似的事情,效果很好。 The biggest win over the solution you have in the question is probably the lack of needing transitional tables, since the ETL can be done in the script and then sent into Postgres via psycopg2 . 解决该问题的最大方法可能是不需要过渡表,因为ETL可以在脚本中完成,然后通过psycopg2发送到Postgres

If your ETL needs are modest, ie limited to the example you provided above, it may be worth sticking with pure SQL . 如果您的ETL需求适中,即仅限于上面提供的示例,那么纯SQL可能值得坚持。 One enhancement to that would be to use a temp table (for loading_tbl ) instead of a regular table. 对此的一种增强是使用temp table (用于loading_tbl )而不是常规表。 That way you wouldn't need to be concerned about dropping it after ETL 'ing the data. 这样,您就不必担心在ETL处理数据后将其删除。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM