简体   繁体   中英

importing csv with psql \copy, modify data as it comes

I frequently need to import a csv into postgres and usually use the \\copy command from psql . It usually looks something like this

\copy tbl FROM import.csv CSV 

I have two common problems which I feel might have a similar answer.

  1. parsing date strings as they come in into a TIMESTAMP field
  2. empty strings in INTEGER fields causing errors

In both cases there is minor modification that needs to be done, but my current solution is to create loading tables with all fields as type VARCHAR , then creating another table with the correct schema. I then use \\copy and

CREATE TABLE loading_tbl (
    datefield VARCHAR,
    integerfield VARCHAR
);    

CREATE TABLE tbl (
    datefield TIMESTAMP,
    integerfield INTEGER
);

\copy loading_tbl FROM import.csv CSV

INSERT INTO tbl (datefield, integerfield)
SELECT
    to_timestamp(datefield, 'YYYY-Mon, DAY HH24:MI a.m'),
    integerfield::INTEGER
FROM loading_tbl;

DROP TABLE loading_tbl;

Is this the best method or is there a simpler way? It is kind of a pain to create two tables especially as the number of fields increases.

Another option would be to use a scripting language to do the ETL . It may be easier to reason about and/or have less overhead, depending on your exact needs.

For example, you could use Python 's csv and psycopg2 modules to interact with the CSV file and Postgres database, respectively, performing any ETL that's necessary. psycopg2 will in general handle the timestamp string to actual Postgres timestamp conversion for you (assuming it's a recognized timestamp string, of which there are a variety of types).

For cases where there are fields in the CSV which are integers in Postgres but empty strings in the CSV , in the Python script, you can check for empty-string values and assign them to be NULL in Postgres instead.

I have used Python to do something like this very recently with good results. The biggest win over the solution you have in the question is probably the lack of needing transitional tables, since the ETL can be done in the script and then sent into Postgres via psycopg2 .

If your ETL needs are modest, ie limited to the example you provided above, it may be worth sticking with pure SQL . One enhancement to that would be to use a temp table (for loading_tbl ) instead of a regular table. That way you wouldn't need to be concerned about dropping it after ETL 'ing the data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM