简体   繁体   中英

Postgres SQL Query With Nested Sub-Query Taking Too Long

I am working with a transaction records database with many millions of rows and the following columns / setup:

Orderdate OrderID CustomerId Product Price Total_Amount
30/02/2018 online-56134 492512952 125582 50 50
20/05/2020 offline-14452 291312855 125582 50 82
20/05/2020 offline-14452 291312855 291824 32 82
15/07/2015 offline-29528 192501431 693012 71 71
09/01/2017 offline-53422 291367825 Donation 10 20
09/01/2017 offline-53422 291367825 214257 10 20
16/11/2016 online-63642 NULL 639102 53 53
11/01/2017 online-96458 891367243 Shipping 10 10

I want to find out the average annual spend of all customers who have transacted in the past three years, and have never transacted offline. I have a query which runs fast enough for all customers:

    SELECT
       (SELECT SUM(CAST(total_amount AS NUMERIC)) FROM (SELECT DISTINCT orderid, total_amount, orderdate 
        FROM sales WHERE orderdate > (NOW() - INTERVAL '12 month') AND customerid IS NOT NULL AND product 
        NOT LIKE 'SHIP%' AND product NOT LIKE 'Ship%' AND product != 'DONATION' AND product != 'Donation' 
        AND customerid NOT LIKE '111222333%') AS "Total Sales - Returns"
       )
    /
       (SELECT COUNT(DISTINCT customerid) FROM sales WHERE orderdate BETWEEN (NOW() - INTERVAL '3 years') 
        AND NOW() AND product NOT LIKE 'SHIP%' AND product NOT LIKE 'Ship%' AND product != 'DONATION' AND 
        product != 'Donation' AND customerid NOT LIKE '111222333%'
       );

However, my solution for online-only customers includes inefficient nested subqueries, which are slowing my query down significantly:

    SELECT
       (SELECT SUM(CAST(total_amount AS NUMERIC)) FROM (SELECT DISTINCT orderid, total_amount, orderdate 
        FROM sales WHERE orderdate > (NOW() - INTERVAL '12 month') AND customerid IS NOT NULL AND product 
        NOT LIKE 'SHIP%' AND product NOT LIKE 'Ship%' AND product != 'DONATION' AND product != 'Donation' 
        AND customerid NOT LIKE '111222333%' AND customerid NOT IN (SELECT customerid FROM sales WHERE 
        orderid NOT LIKE 'online%')) AS "Total Sales - Returns"
       )
    /
       (SELECT COUNT(DISTINCT customerid) FROM sales WHERE orderdate BETWEEN (NOW() - INTERVAL '3 years') 
        AND NOW() AND product NOT LIKE 'SHIP%' AND product NOT LIKE 'Ship%' AND product != 'DONATION' AND 
        product != 'Donation' AND customerid NOT LIKE '111222333%' AND customerid NOT IN (SELECT 
        customerid FROM sales WHERE orderid NOT LIKE 'online%')
       );

Overall, I have many similar queries (such as some for average transaction quantity, time between transactions, first purchase date and more). Thus, I need to apply a similar logic for online-only customers to many queries, I also need to exclude online-only customers. Indeed, there are three sets of queries, one for all, one for online-only, and one which excludes online-only.

Does anyone have advice on how I can speed up the above query and other online-only customer queries up significantly?

If I follow you correctly, you can get the average yearly annual of the last 3 years for customers that only had online sales with the following query:

select customerid, sum(total_amount) / 3 as avg_year_amount
from sale
where orderdate > current_date - interval '3 year'
group by customerid
having bool_and(orderid like 'online%')

If you want the overall average of such customers, you can add another level of aggregation:

select avg(avg_year_amount) as grand_avg
from (
    select customerid, sum(total_amount) / 3 as avg_year_amount
    from sale
    where orderdate > current_date - interval '3 year'
    group by customerid
    having bool_and(orderid like 'online%')
) t

Your query has additional filters in the where clauses that are not described in the question. You can add them to the where clause of the subquery as needed.

I want to find out the average annual spend of all customers who have transacted in the past three years, and have never transacted offline.

I don't find this explanation totally clear. Let me assume that you want:

  • Any customer who has had any transaction in the past three years.
  • The total spend over three years divided by 3.
  • Was always online, during the three years and before.

Note: A customer who has exactly one transaction of 300 2.5 years ago would count as 100 per year (if included).

Then:

select sum(total_amount) / (3 * count(*)) as yearly_average
from (select s.*,
             bool_and(orderid like 'online%') over (partition by customerid) as always_online
      from sales s
     ) s
where always_online and
      orderdate > current_date - interval '3 year';

I guess

(SELECT customerid FROM sales WHERE orderid NOT LIKE 'online%')

is evaluated repeatedly for every row, returning the same result every time and wasting so much time. If the subquery is first put into the temporary table as

WITH offcus (id) AS (
  SELECT customerid FROM sales
         WHERE orderid NOT LIKE 'online%')
SELECT ... AND customerid NOT IN (SELET id FROM offcustomer) ...

your query may be as fast as your "fast enough" query, though not tested myself. EXPALIN command is worth of try as it gives clear view of how queries are executed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM