简体   繁体   English


[英]Check if records exists in a Postgres table

I have to read a CSV every 20 seconds. 我必须每20秒阅读一次CSV。 Each CSV contains min. 每个CSV包含min。 of 500 to max. 500到最大 60000 lines. 60000行。 I have to insert the data in a Postgres table, but before that I need to check if the items have already been inserted, because there is a high probability of getting duplicate item. 我必须在Postgres表中插入数据,但在此之前我需要检查项是否已经插入,因为很有可能获得重复项。 The field to check for uniqueness is also indexed. 检查唯一性的字段也被编入索引。

So, I read the file in chunks and use the IN clause to get the items already in the database. 因此,我以块的形式读取文件并使用IN子句来获取数据库中已有的项目。

Is there a better way of doing it? 有没有更好的方法呢?

This should perform well: 这应该表现良好:

CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data

COPY tmp FROM '/absolute/path/to/file' FORMAT csv;

SELECT tmp.*
FROM   tmp
LEFT   JOIN tbl USING (tbl_id)
WHERE  tbl.tbl_id IS NULL;

DROP TABLE tmp; -- else dropped at end of session automatically

Closely related to this answer . 此答案密切相关。

First just for completeness I changed Erwin's code to use except 首先,为了完整性我改变了Erwin的代码, except

CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;

SELECT tmp.*
FROM   tmp
select *
from tbl


Then I resolved to test it myself. 然后我决定自己测试一下。 I tested it in 9.1 with a mostly untouched postgresql.conf . 我在9.1中使用了一个大多数未触及的postgresql.conf进行了测试。 The target table contains 10 million rows and the origin table 30 thousand. 目标表包含1000万行,原始表包含3万行。 15 thousand already exists in the target table. 目标表中已存在15000个。

create table tbl (id integer primary key)
insert into tbl
select generate_series(1, 10000000)
create temp table tmp as select * from tbl limit 0
insert into tmp
select generate_series(9985000, 10015000)

I asked for the explain of the select part only. 我只询问了选择部分的说明。 The except version: except版本:

select *
from tmp
select *
from tbl
                                       QUERY PLAN                                       
 HashSetOp Except  (cost=0.00..270098.68 rows=200 width=4)
   ->  Append  (cost=0.00..245018.94 rows=10031897 width=4)
         ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..771.40 rows=31920 width=4)
               ->  Seq Scan on tmp  (cost=0.00..452.20 rows=31920 width=4)
         ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..244247.54 rows=9999977 width=4)
               ->  Seq Scan on tbl  (cost=0.00..144247.77 rows=9999977 width=4)
(6 rows)

The outer join version: outer join版本:

select *
    left join
    tbl using (id)
where tbl.id is null
                                QUERY PLAN                                
 Nested Loop Anti Join  (cost=0.00..208142.58 rows=15960 width=4)
   ->  Seq Scan on tmp  (cost=0.00..452.20 rows=31920 width=4)
   ->  Index Scan using tbl_pkey on tbl  (cost=0.00..7.80 rows=1 width=4)
         Index Cond: (tmp.id = id)
(4 rows)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM