简体   繁体   中英

Query planner not using filter to limit calculations in high-cost join

I am trying to fix a performance problem that I am having with a query in PostgreSQL.

General concept being modeled: Software licences being purchased and assigned. I think I have removed enough of the other things being modeled that it is now very similar to a standard hotel room booking system, except that it is normal for a current assignment (hotel booking) to have no known end date.

Query's purpose: This is a view that assembles the information necessary to display information about a licence, and the purchase it came from. When the application queries the view, it provides a tLicence.id , so that one row is returned.

Non-hotel-like concepts remaining in the query:

  • Some licencing agreements limit how quickly software can be re-assigned; this has been hard-coded into the query as 1 day.
  • It is theoretically possible for a licence to have concurrent assignments, both past and present; this should not happen, and the application discourages it, but if humans made an error in the real world, the application does allow that error to be entered into the system. This is clearly different from a normal hotel system, where if guests walked into the wrong room, the current occupants would object.

The nested SELECT with the alias purchase_quantities_assignnments is a view in the database (inlined here for convenience). Ideally, I would like any fix to my performance issue to not require that a modified version of this view be inlined into the query; ideally, the view can continue to exist mostly as-is, and be used in other ways in other queries.

Problem

If I query this view (the query) with WHERE tLicences.id = 19 , the result takes a long time to generate. In particular, it seems to be generating the entire set for periodsOfAvailability_start (which is slow), and then joining; this conclusion is based on the EXPLAIN ANALYZE GroupAggregate returning 10 rows (which is the number of purchases). I feel like query planner should be able to figure out that the tAssignments.purchase_id could be used to significantly reduce how much of periodsOfAvailability_start needs to be generated.

However, if I query this view (the query) with WHERE tLicences.id = 19 AND tLicences.purchase_id = ? [ ? being that licence's purchase ID], then the query runs as expected, generating only the set for periodsOfAvailability_start that has that purchase ID; this conclusion is based on the EXPLAIN ANALYZE GroupAggregate returning 1 row (which is the number of purchases to which the licence belongs).

Query

SELECT *
FROM test.tPurchases AS tPurchases
INNER JOIN test.tLicences
    ON tLicences.purchase_id = tPurchases.id
LEFT JOIN (
    SELECT
        purchase_id,
        SUM(
            CASE
                assignment_newer_id IS NOT null
                WHEN true THEN 1
                WHEN false THEN 0
            END
        ) AS prchs_quantity_assigned,
        SUM(
            CASE
                assignment_newer_id IS null AND
                current_timestamp
                    BETWEEN licence_availability_start AND
                        licence_availability_end
                WHEN true THEN 1
                WHEN false THEN 0
            END
        ) AS prchs_quantity_notAssignedAndCanBeAssigned,
        SUM(
            CASE
                assignment_newer_id IS null AND
                current_timestamp < licence_availability_start
                WHEN true THEN 1
                WHEN false THEN 0
            END
        ) AS prchs_quantity_notAssignedAndCannotBeAssigned
    FROM (
        SELECT
            tPurchases.id AS purchase_id,
            tPurchases.date_ AS purchase_date,
            tLicences.id AS licence_id,
            GREATEST(
                tPurchases.date_,
                older.end_,
                older.start + '1 day'::interval
            ) AS licence_availability_start,
            CASE
                WHEN newer.id IS NULL THEN 'infinity'
                ELSE newer.start - '1 day'::interval
            END AS licence_availability_end,
            COALESCE(newer.start, 'infinity') AS licence_availability_uninstallBy,
            older.id AS assignment_older_id,
            older.start AS assignment_older_start,
            older.end_ AS assignment_older_end,
            newer.id AS assignment_newer_id,
            newer.start AS assignment_newer_start,
            newer.end_ AS assignment_newer_end
        FROM test.tLicences
        INNER JOIN test.tPurchases
            ON tPurchases.id = tLicences.purchase_id
        LEFT JOIN test.tAssignments AS older
            ON (
                NOT older.deleted AND
                older.licence_id = tLicences.id
            )
        LEFT JOIN test.tAssignments AS newer
            ON (
                NOT newer.deleted AND
                newer.id <> older.id AND
                newer.licence_id = older.licence_id
            )
        WHERE
            NOT tLicences.deleted
        UNION
        SELECT
            tPurchases.id AS purchase_id,
            tPurchases.date_ AS purchase_date,
            tLicences.id AS licence_id,
            tPurchases.date_ AS licence_availability_start,
            oldest.start - '1 day'::interval AS licence_availability_end,
            oldest.start AS licence_availability_uninstallBy,
            null AS assignment_older_id,
            null AS assignment_older_start,
            null AS assignment_older_end,
            oldest.id AS assignment_newer_id,
            oldest.start AS assignment_newer_start,
            oldest.end_ AS assignment_newer_end
        FROM test.tLicences
        INNER JOIN test.tPurchases
            ON tPurchases.id = tLicences.purchase_id
        INNER JOIN test.tAssignments AS oldest
            ON oldest.licence_id = tLicences.id
        WHERE
            NOT tLicences.deleted AND
            NOT oldest.deleted
    ) AS periodsOfAvailability_start
    WHERE
        (assignment_newer_id IS null OR assignment_newer_end IS null)
    GROUP BY purchase_id
) AS purchase_quantities_assignnments
    ON
        purchase_quantities_assignnments.purchase_id = tPurchases.id
WHERE
    tLicences.id = 19 /* [Unexpected behaviour] The full set for "purchase_quantities_assignnments" is generated */
    --tLicences.id = 19 AND tLicences.purchase_id = ? /* [Expected behaviour] Only the single relevant row for "purchase_quantities_assignnments" appears to be generated */

    --tLicences.id = 19 AND tPurchases.id = ? /* [Expected behaviour] Only the single relevant row for "purchase_quantities_assignnments" appears to be generated */
    --tLicences.purchase_id = ? /* [Expected behaviour] Only the single relevant row for "purchase_quantities_assignnments" appears to be generated. Note: This is a different query *result* than the others */

Question: Is there some way that I can resolve this issue, without having to provide tLicences.purchase_id ?

Database version: PostgreSQL 9.0


SQL to generate a schema, tables, and populate those tables:

This is kind of long-running, since I wanted a quantity similar to our actual data. The number of licences (30000) and the number of assignments (100000) can be reduced if the run-time is a problem.

CREATE SCHEMA test;

CREATE TABLE test.tPurchases (
    id                  serial      not null,
    date_               date        not null,               
    /* … */
    deleted             boolean     not null    DEFAULT false,
    PRIMARY KEY (id)
);

CREATE TABLE test.tLicences (
    id                  serial      not null,
    purchase_id         integer     not null,
    /* … */
    deleted             boolean     not null    DEFAULT false,
    PRIMARY KEY (id),
    FOREIGN KEY (purchase_id)
        REFERENCES test.tPurchases (id)
        ON UPDATE RESTRICT
        ON DELETE RESTRICT
);
CREATE INDEX ON test.tLicences(purchase_id);

CREATE TABLE test.tAssignments (
    id                  serial      not null,
    licence_id          integer     not null,
    start               date        not null,
    end_                date,
    /* … */
    deleted             boolean     not null    DEFAULT false,
    PRIMARY KEY (id),
    FOREIGN KEY (licence_id)
        REFERENCES test.tLicences (id)
        ON UPDATE RESTRICT
        ON DELETE RESTRICT,
    CHECK (start <= end_)
);
CREATE INDEX ON test.tAssignments(licence_id);



INSERT INTO test.tPurchases(id, date_)
    SELECT
        id,
        '2000-01-01'::timestamp +  random() * '1 year'::interval AS date_
    FROM generate_series(1, 10) AS id
;

INSERT INTO test.tLicences(id, purchase_id, deleted)
    SELECT
        id,
        trunc(random() * 10 + 1) AS purchase_id,
        (random() > 0.99) AS deleted
    FROM generate_series(1, 30000) AS id
;

INSERT INTO test.tAssignments(id, licence_id, start, end_, deleted)
    SELECT
        assignments.id,
        assignments.licence_id,
        tPurchases.date_ + ((rank * 20 + random() * 10) || ' days')::interval AS start,
        CASE
            assignments.rank = max(assignments.rank) OVER (PARTITION BY assignments.licence_id) AND
            random() > 0.5
            WHEN true THEN null
            ELSE tPurchases.date_ + ((rank * 20 + 10 + random() * 10) || ' days')::interval
        END AS end_,
        (random() > 0.95) AS deleted
    FROM (
        SELECT
            assignments.id,
            assignments.licence_id,
            rank() OVER (PARTITION BY assignments.licence_id ORDER BY assignments.id) AS rank
        FROM (
            SELECT
                id,
                trunc(random() * 30000 + 1) AS licence_id
            FROM generate_series(1, 100000) AS id
        ) AS assignments
    ) AS assignments
    INNER JOIN test.tLicences
        ON tLicences.id = assignments.licence_id
    INNER JOIN test.tPurchases
        ON tPurchases.id = tLicences.purchase_id
;

You might need to run statistics but often you can force the optimization you want with a CTE. Here I also take out all your sub-queries to CTEs just to make it clear:

WITH myPurchases AS
( 
  SELECT *
  FROM test.tPurchases AS tPurchases
  WHERE tLicences.id = 19 
), periodsOfAvailability_start AS
(
  SELECT
      tPurchases.id AS purchase_id,
      tPurchases.date_ AS purchase_date,
      tLicences.id AS licence_id,
      GREATEST(tPurchases.date_, older.end_, older.start + '1 day'::interval) AS licence_availability_start,
      CASE WHEN newer.id IS NULL THEN 'infinity' ELSE newer.start - '1 day'::interval END AS licence_availability_end,
      COALESCE(newer.start, 'infinity') AS licence_availability_uninstallBy,
      older.id AS assignment_older_id,
      older.start AS assignment_older_start,
      older.end_ AS assignment_older_end,
      newer.id AS assignment_newer_id,
      newer.start AS assignment_newer_start,
      newer.end_ AS assignment_newer_end
  FROM test.tLicences
  INNER JOIN myPurchases AS tPurchases ON tPurchases.id = tLicences.purchase_id
  LEFT JOIN test.tAssignments AS older ON (NOT older.deleted AND older.licence_id = tLicences.id)
  LEFT JOIN test.tAssignments AS newer ON (NOT newer.deleted AND newer.id <> older.id AND newer.licence_id = older.licence_id)
  WHERE NOT tLicences.deleted

  UNION

  SELECT
      tPurchases.id AS purchase_id,
      tPurchases.date_ AS purchase_date,
      tLicences.id AS licence_id,
      tPurchases.date_ AS licence_availability_start,
      oldest.start - '1 day'::interval AS licence_availability_end,
      oldest.start AS licence_availability_uninstallBy,
      null AS assignment_older_id,
      null AS assignment_older_start,
      null AS assignment_older_end,
      oldest.id AS assignment_newer_id,
      oldest.start AS assignment_newer_start,
      oldest.end_ AS assignment_newer_end
  FROM test.tLicences
  INNER JOIN myPurchases AS tPurchases ON tPurchases.id = tLicences.purchase_id
  INNER JOIN test.tAssignments AS oldest ON oldest.licence_id = tLicences.id
  WHERE NOT tLicences.deleted AND NOT oldest.deleted
), purchase_quantities_assignnments AS
(
  SELECT
    purchase_id,
    SUM(CASE WHEN assignment_newer_id IS NOT null THEN 1 ELSE 0 END) AS prchs_quantity_assigned,
    SUM(CASE WHEN assignment_newer_id IS null AND current_timestamp BETWEEN licence_availability_start AND licence_availability_end THEN 1 ELSE false END) AS prchs_quantity_notAssignedAndCanBeAssigned,
    SUM(CASE WHEN assignment_newer_id IS null AND current_timestamp < licence_availability_start THEN 1 ELSE 0 END) AS prchs_quantity_notAssignedAndCannotBeAssigned
  FROM periodsOfAvailability_start
  WHERE assignment_newer_id IS null OR assignment_newer_end IS null
  GROUP BY purchase_id
)
SELECT *
FROM myPurchases AS tPurchases
INNER JOIN test.tLicences ON tLicences.purchase_id = tPurchases.id
LEFT JOIN purchase_quantities_assignnments ON purchase_quantities_assignnments.purchase_id = tPurchases.id

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM