简体   繁体   English

在Postgres中使用CTE进行删除比使用temp表慢

[英]Delete using CTE slower than using temp table in Postgres

I'm wondering if somebody can explain why this runs so much longer using CTEs rather than temp tables... I'm basically deleting duplicate information out of a customer table (why duplicate information exists is beyond the scope of this post). 我想知道是否有人可以解释为什么使用CTE而不是临时表可以运行这么长时间……我基本上是从客户表中删除重复信息(为什么存在重复信息超出了本文的范围)。

This is Postgres 9.5. 这是Postgres 9.5。

The CTE version is this: CTE版本是这样的:

with targets as
    (
        select
            id,
            row_number() over(partition by uuid order by created_date desc) as rn
        from
            customer
    )
delete from
    customer
where
    id in
        (
            select
                id
            from
                targets
            where
                rn > 1
        );

I killed that version this morning after running for over an hour. 我在运行了一个多小时后于今天早晨杀死了该版本。

The temp table version is this: 临时表的版本是这样的:

create temp table
    targets
as select
    id,
    row_number() over(partition by uuid order by created_date desc) as rn
from
    customer;

delete from
    customer
where
    id in
        (
            select
                id
            from
                targets
            where
                rn > 1
        );

This version finishes in about 7 seconds. 此版本完成大约7秒钟。

Any idea what may be causing this? 知道是什么原因造成的吗?

The CTE is slower because it has to be executed unaltered (via a CTE scan). CTE速度较慢,因为它必须不更改地执行(通过CTE扫描)。

TFM (section 7.8.2) states: Data-modifying statements in WITH are executed exactly once, and always to completion, independently of whether the primary query reads all (or indeed any) of their output. TFM(第7.8.2节)指出: WITH中的数据修改语句仅执行一次,并且始终执行至完成,而与主查询是否读取其所有(或实际上)任何输出无关。 Notice that this is different from the rule for SELECT in WITH: as stated in the previous section, execution of a SELECT is carried only as far as the primary query demands its output. 请注意,这与WITH中的SELECT规则不同:如上一节所述,仅在主查询需要其输出时,才执行SELECT的执行。

It is thus an optimisation barrier ; 因此这是一个优化障碍 ; for the optimiser, dismantling the CTE is not allowed, even if it would result in a smarter plan with the same results. 对于优化者,不允许拆除CTE,即使这样做会导致更明智的计划并获得相同的结果。

The CTE-solution can be refactored into a joined subquery, though (similar to the temp table in the question). 不过,CTE解决方案可以重构为联接的子查询(类似于问题中的临时表)。 In postgres, a joined subquery is usually faster than the EXISTS() variant, nowadays. 在postgres中,如今,联接子查询通常比EXISTS()变体快。

DELETE FROM customer del
USING ( SELECT id
        , row_number() over(partition by uuid order by created_date desc)
                 as rn
        FROM customer
        ) sub
WHERE sub.id = del.id
AND sub.rn > 1
        ;

Another way is to use a TEMP VIEW . 另一种方法是使用TEMP VIEW This is syntactically equivalent to the temp table case, but semantically equivalent to the joined subquery form (they yield exactly the same query plan, at least in this case). 这在语法上等效于temp table情况,但在语义上等效于联接的子查询形式(至少在这种情况下,它们产生完全相同的查询计划)。 This is because Postgres's optimiser dismantles the view and combines it with the main query ( pull-up ). 这是因为Postgres的优化程序会分解视图,并将其与主查询结合起来(上 )。 You could see a view as a kind of macro in PG. 您可能会在PG中将view视为一种宏。

CREATE TEMP VIEW targets
AS SELECT id
        , row_number() over(partition by uuid ORDER BY created_date DESC) AS rn
FROM customer;

EXPLAIN
DELETE FROM customer
WHERE id IN ( SELECT id
            FROM targets
            WHERE rn > 1
        );

[UPDATED: I was wrong about the CTEs need to be always-executed-to-completion, which is only the case for data-modifying CTEs] [更新:我错了,因为CTE必须始终执行才能完成,只有数据修改CTE才是这种情况]

Using a CTE is likely going to cause different bottlenecks than using a temporary table. 与使用临时表相比,使用CTE可能会导致不同的瓶颈。 I'm not familiar with how PostgreSQL implements CTE, but it is likely in memory, so if your server is memory starved and the resultset of your CTE is very large then you could run into issues there. 我不熟悉PostgreSQL如何实现CTE,但是它很可能在内存中,因此,如果您的服务器内存不足,并且CTE的结果集很大,那么您可能会遇到问题。 I would monitor the server while running your query and try to find where the bottleneck is. 我将在运行查询时监视服务器,并尝试查找瓶颈所在。

An alternative way to doing that delete which might be faster than both of your methods: 另一种执行删除的方法可能比两种方法都快:

DELETE C
FROM
    Customer C
WHERE
    EXISTS (SELECT * FROM Customer C2 WHERE C2.uuid = C.uuid AND C2.created_date > C.created_date)

That won't handle situations where you have exact matches with created_date , but that can be solved by adding the id to the subquery as well. 那不会处理您与created_date完全匹配的情况,但是也可以通过将id添加到子查询中来解决。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM