简体   繁体   English

检查Postgres表中是否存在记录

[英]Check if records exists in a Postgres table

I have to read a CSV every 20 seconds. 我必须每20秒阅读一次CSV。 Each CSV contains min. 每个CSV包含min。 of 500 to max. 500到最大 60000 lines. 60000行。 I have to insert the data in a Postgres table, but before that I need to check if the items have already been inserted, because there is a high probability of getting duplicate item. 我必须在Postgres表中插入数据,但在此之前我需要检查项是否已经插入,因为很有可能获得重复项。 The field to check for uniqueness is also indexed. 检查唯一性的字段也被编入索引。

So, I read the file in chunks and use the IN clause to get the items already in the database. 因此,我以块的形式读取文件并使用IN子句来获取数据库中已有的项目。

Is there a better way of doing it? 有没有更好的方法呢?

This should perform well: 这应该表现良好:

CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data

COPY tmp FROM '/absolute/path/to/file' FORMAT csv;

INSERT INTO tbl
SELECT tmp.*
FROM   tmp
LEFT   JOIN tbl USING (tbl_id)
WHERE  tbl.tbl_id IS NULL;

DROP TABLE tmp; -- else dropped at end of session automatically

Closely related to this answer . 此答案密切相关。

First just for completeness I changed Erwin's code to use except 首先,为了完整性我改变了Erwin的代码, except

CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;

INSERT INTO tbl
SELECT tmp.*
FROM   tmp
except
select *
from tbl

DROP TABLE tmp;

Then I resolved to test it myself. 然后我决定自己测试一下。 I tested it in 9.1 with a mostly untouched postgresql.conf . 我在9.1中使用了一个大多数未触及的postgresql.conf进行了测试。 The target table contains 10 million rows and the origin table 30 thousand. 目标表包含1000万行,原始表包含3万行。 15 thousand already exists in the target table. 目标表中已存在15000个。

create table tbl (id integer primary key)
;
insert into tbl
select generate_series(1, 10000000)
;
create temp table tmp as select * from tbl limit 0
;
insert into tmp
select generate_series(9985000, 10015000)
;

I asked for the explain of the select part only. 我只询问了选择部分的说明。 The except version: except版本:

explain
select *
from tmp
except
select *
from tbl
;
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 HashSetOp Except  (cost=0.00..270098.68 rows=200 width=4)
   ->  Append  (cost=0.00..245018.94 rows=10031897 width=4)
         ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..771.40 rows=31920 width=4)
               ->  Seq Scan on tmp  (cost=0.00..452.20 rows=31920 width=4)
         ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..244247.54 rows=9999977 width=4)
               ->  Seq Scan on tbl  (cost=0.00..144247.77 rows=9999977 width=4)
(6 rows)

The outer join version: outer join版本:

explain
select *
from 
    tmp
    left join
    tbl using (id)
where tbl.id is null
;
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Nested Loop Anti Join  (cost=0.00..208142.58 rows=15960 width=4)
   ->  Seq Scan on tmp  (cost=0.00..452.20 rows=31920 width=4)
   ->  Index Scan using tbl_pkey on tbl  (cost=0.00..7.80 rows=1 width=4)
         Index Cond: (tmp.id = id)
(4 rows)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM