简体   繁体   中英

Delete using CTE slower than using temp table in Postgres

I'm wondering if somebody can explain why this runs so much longer using CTEs rather than temp tables... I'm basically deleting duplicate information out of a customer table (why duplicate information exists is beyond the scope of this post).

This is Postgres 9.5.

The CTE version is this:

with targets as
    (
        select
            id,
            row_number() over(partition by uuid order by created_date desc) as rn
        from
            customer
    )
delete from
    customer
where
    id in
        (
            select
                id
            from
                targets
            where
                rn > 1
        );

I killed that version this morning after running for over an hour.

The temp table version is this:

create temp table
    targets
as select
    id,
    row_number() over(partition by uuid order by created_date desc) as rn
from
    customer;

delete from
    customer
where
    id in
        (
            select
                id
            from
                targets
            where
                rn > 1
        );

This version finishes in about 7 seconds.

Any idea what may be causing this?

The CTE is slower because it has to be executed unaltered (via a CTE scan).

TFM (section 7.8.2) states: Data-modifying statements in WITH are executed exactly once, and always to completion, independently of whether the primary query reads all (or indeed any) of their output. Notice that this is different from the rule for SELECT in WITH: as stated in the previous section, execution of a SELECT is carried only as far as the primary query demands its output.

It is thus an optimisation barrier ; for the optimiser, dismantling the CTE is not allowed, even if it would result in a smarter plan with the same results.

The CTE-solution can be refactored into a joined subquery, though (similar to the temp table in the question). In postgres, a joined subquery is usually faster than the EXISTS() variant, nowadays.

DELETE FROM customer del
USING ( SELECT id
        , row_number() over(partition by uuid order by created_date desc)
                 as rn
        FROM customer
        ) sub
WHERE sub.id = del.id
AND sub.rn > 1
        ;

Another way is to use a TEMP VIEW . This is syntactically equivalent to the temp table case, but semantically equivalent to the joined subquery form (they yield exactly the same query plan, at least in this case). This is because Postgres's optimiser dismantles the view and combines it with the main query ( pull-up ). You could see a view as a kind of macro in PG.

CREATE TEMP VIEW targets
AS SELECT id
        , row_number() over(partition by uuid ORDER BY created_date DESC) AS rn
FROM customer;

EXPLAIN
DELETE FROM customer
WHERE id IN ( SELECT id
            FROM targets
            WHERE rn > 1
        );

[UPDATED: I was wrong about the CTEs need to be always-executed-to-completion, which is only the case for data-modifying CTEs]

Using a CTE is likely going to cause different bottlenecks than using a temporary table. I'm not familiar with how PostgreSQL implements CTE, but it is likely in memory, so if your server is memory starved and the resultset of your CTE is very large then you could run into issues there. I would monitor the server while running your query and try to find where the bottleneck is.

An alternative way to doing that delete which might be faster than both of your methods:

DELETE C
FROM
    Customer C
WHERE
    EXISTS (SELECT * FROM Customer C2 WHERE C2.uuid = C.uuid AND C2.created_date > C.created_date)

That won't handle situations where you have exact matches with created_date , but that can be solved by adding the id to the subquery as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM