简体   繁体   English

如何使用 psql \\copy 元命令忽略错误

[英]How to ignore errors with psql \copy meta-command

I am using psql with a PostgreSQL database and the following copy command:我将psql与 PostgreSQL 数据库和以下copy命令一起使用:

\COPY isa (np1, np2, sentence) FROM 'c:\Downloads\isa.txt' WITH DELIMITER '|'

I get:我得到:

ERROR:  extra data after last expected column

How can I skip the lines with errors?如何跳过有错误的行?

You cannot skip the errors without skipping the whole command up to and including Postgres 14. There is currently no more sophisticated error handling.如果不跳过包括 Postgres 14 在内的整个命令,就无法跳过错误。目前没有更复杂的错误处理。

\\copy is just a wrapper around SQL COPY that channels results through psql. \\copy只是 SQL COPY的包装器,它通过 psql 传递结果。 The manual for COPY : COPY手册:

COPY stops operation at the first error. COPY在出现第一个错误时停止操作。 This should not lead to problems in the event of a COPY TO , but the target table will already have received earlier rows in a COPY FROM .这应该不会在COPY TO的情况下导致问题,但是目标表已经在COPY FROM收到了较早的行。 These rows will not be visible or accessible, but they still occupy disk space.这些行将不可见或不可访问,但它们仍会占用磁盘空间。 This might amount to a considerable amount of wasted disk space if the failure happened well into a large copy operation.如果故障发生在大型复制操作中,这可能会浪费大量磁盘空间。 You might wish to invoke VACUUM to recover the wasted space.您可能希望调用VACUUM来恢复浪费的空间。

Bold emphasis mine.大胆强调我的。 And:并且:

COPY FROM will raise an error if any line of the input file contains more or fewer columns than are expected.如果输入文件的任何行包含的列比预期的多COPY FROM将引发错误。

COPY is an extremely fast way to import / export data. COPY是一种极其快速的数据导入/导出方式。 Sophisticated checks and error handling would slow it down.复杂的检查和错误处理会减慢它的速度。

There was an attempt to add error logging to COPY in Postgres 9.0 but it was never committed.尝试在 Postgres 9.0 中向COPY添加错误日志记录,但从未提交。

Solution解决方案

Fix your input file instead.改为修复您的输入文件。

If you have one or more additional column in your input file and the file is otherwise consistent , you might add dummy columns to your table isa and drop those afterwards.如果您的输入文件中有一个或多个附加列,并且该文件在其他方面一致的,您可以向表isa添加虚拟列,然后再删除这些列。 Or (cleaner with production tables) import to a temporary staging table and INSERT selected columns (or expressions) to your target table isa from there.或者(使用生产表进行清理)导入到临时登台表并将选定的列(或表达式)从那里INSERT到目标表isa

Related answers with detailed instructions:带有详细说明的相关答案:

It is too bad that in 25 years Postgres doesn't have -ignore-errors flag or option for COPY command.太糟糕了,25 年来 Postgres 没有-ignore-errors标志或COPY命令选项。 In this era of BigData you get a lot of dirty records and it can be very costly for the project to fix every outlier.在这个大数据时代,你会得到很多脏记录,项目修复每个异常值的成本可能非常高。

I had to make a work-around this way:我不得不以这种方式解决问题:

  1. Copy the original table and call it dummy_original_table复制原表并命名为dummy_original_table
  2. in the original table, create a trigger like this:在原始表中,创建一个这样的触发器:
    CREATE OR REPLACE FUNCTION on_insert_in_original_table() RETURNS trigger AS  $$  
    DECLARE
        v_rec   RECORD;
    BEGIN
        -- we use the trigger to prevent 'duplicate index' error by returning NULL on duplicates
        SELECT * FROM original_table WHERE primary_key=NEW.primary_key INTO v_rec;
        IF v_rec IS NOT NULL THEN
            RETURN NULL;
        END IF; 
        BEGIN 
            INSERT INTO original_table(datum,primary_key) VALUES(NEW.datum,NEW.primary_key)
                ON CONFLICT DO NOTHING;
        EXCEPTION
            WHEN OTHERS THEN
                NULL;
        END;
        RETURN NULL;
    END;
  1. Run a copy into the dummy table.将副本运行到虚拟表中。 No record will be inserted there, but all of them will be inserted in the original_table那里不会插入任何记录,但都会插入到 original_table 中

psql dbname -c \\copy dummy_original_table(datum,primary_key) FROM '/home/user/data.csv' delimiter E'\\t'

Here's one solution -- import the batch file one line at a time.这是一种解决方案——一次一行导入批处理文件。 The performance can be much slower, but it may be sufficient for your scenario:性能可能会慢得多,但对于您的场景可能已经足够了:

#!/bin/bash

input_file=./my_input.csv
tmp_file=/tmp/one-line.csv
cat $input_file | while read input_line; do
    echo "$input_line" > $tmp_file
    psql my_database \
     -c "\
     COPY my_table \
     FROM `$tmp_file` \
     DELIMITER '|'\
     CSV;\
    "
done

Additionally, you could modify the script to capture the psql stdout/stderr and exit status, and if the exit status is non-zero, echo $input_line and the captured stdout/stderr to stdin and/or append it to a file.此外,您可以修改脚本以捕获psql stdout/stderr 和退出状态,如果退出状态非零, $input_line和捕获的 stdout/stderr 回显到 stdin 和/或将其附加到文件中。

Workaround: remove the reported errant line using sed and run \\copy again解决方法:使用sed删除报告的错误行并再次运行\\copy

Later versions of Postgres (including Postgres 13), will report the line number of the error.更高版本的 Postgres(包括 Postgres 13)将报告错误的行号。 You can then remove that line with sed and run \\copy again, eg,然后,您可以使用sed删除该行并再次运行 \\copy,例如,

#!/bin/bash
bad_line_number=5  # assuming line 5 is the bad line
sed ${bad_line_number}d < input.csv > filtered.csv

[per the comment from @Botond_Balázs ] [根据@Botond_Balázs 的评论]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM