[英]Check if records exists in a Postgres table
I have to read a CSV every 20 seconds. 我必须每20秒阅读一次CSV。 Each CSV contains min.
每个CSV包含min。 of 500 to max.
500到最大 60000 lines.
60000行。 I have to insert the data in a Postgres table, but before that I need to check if the items have already been inserted, because there is a high probability of getting duplicate item.
我必须在Postgres表中插入数据,但在此之前我需要检查项是否已经插入,因为很有可能获得重复项。 The field to check for uniqueness is also indexed.
检查唯一性的字段也被编入索引。
So, I read the file in chunks and use the IN clause to get the items already in the database. 因此,我以块的形式读取文件并使用IN子句来获取数据库中已有的项目。
Is there a better way of doing it? 有没有更好的方法呢?
This should perform well: 这应该表现良好:
CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;
INSERT INTO tbl
SELECT tmp.*
FROM tmp
LEFT JOIN tbl USING (tbl_id)
WHERE tbl.tbl_id IS NULL;
DROP TABLE tmp; -- else dropped at end of session automatically
Closely related to this answer . 与此答案密切相关。
First just for completeness I changed Erwin's code to use except
首先,为了完整性我改变了Erwin的代码,
except
CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;
INSERT INTO tbl
SELECT tmp.*
FROM tmp
except
select *
from tbl
DROP TABLE tmp;
Then I resolved to test it myself. 然后我决定自己测试一下。 I tested it in 9.1 with a mostly untouched
postgresql.conf
. 我在9.1中使用了一个大多数未触及的
postgresql.conf
进行了测试。 The target table contains 10 million rows and the origin table 30 thousand. 目标表包含1000万行,原始表包含3万行。 15 thousand already exists in the target table.
目标表中已存在15000个。
create table tbl (id integer primary key)
;
insert into tbl
select generate_series(1, 10000000)
;
create temp table tmp as select * from tbl limit 0
;
insert into tmp
select generate_series(9985000, 10015000)
;
I asked for the explain of the select part only. 我只询问了选择部分的说明。 The
except
version: except
版本:
explain
select *
from tmp
except
select *
from tbl
;
QUERY PLAN
----------------------------------------------------------------------------------------
HashSetOp Except (cost=0.00..270098.68 rows=200 width=4)
-> Append (cost=0.00..245018.94 rows=10031897 width=4)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..771.40 rows=31920 width=4)
-> Seq Scan on tmp (cost=0.00..452.20 rows=31920 width=4)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..244247.54 rows=9999977 width=4)
-> Seq Scan on tbl (cost=0.00..144247.77 rows=9999977 width=4)
(6 rows)
The outer join
version: outer join
版本:
explain
select *
from
tmp
left join
tbl using (id)
where tbl.id is null
;
QUERY PLAN
--------------------------------------------------------------------------
Nested Loop Anti Join (cost=0.00..208142.58 rows=15960 width=4)
-> Seq Scan on tmp (cost=0.00..452.20 rows=31920 width=4)
-> Index Scan using tbl_pkey on tbl (cost=0.00..7.80 rows=1 width=4)
Index Cond: (tmp.id = id)
(4 rows)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.