I have two tables which look like this:
table_1:
-----------------------------------
| ID | customer_id | city |
-----------------------------------
| 0 | E100 | Sydney |
-----------------------------------
| 1 | E200 | Toronto |
-----------------------------------
| 2 | E300 | New York |
-----------------------------------
table_2:
----------------------------------------------
| customer_id | timestamp | receipt |
----------------------------------------------
| E200 | '2019-03-25' | 200$ |
----------------------------------------------
| E300 | '2019-03-26' | 300$ |
----------------------------------------------
| E300 | '2019-03-26' | 100$ |
----------------------------------------------
| E100 | '2019-03-27' | 50$ |
----------------------------------------------
| E100 | '2019-03-28' | 50$ |
----------------------------------------------
| E100 | '2019-03-29' | 50$ |
----------------------------------------------
What I want to do is to, sum up all receipts for each distinct customer_id. The result table should look like the following:
----------------------------------------------
| customer_id | city | sum(receipt) |
----------------------------------------------
| E100 | Sydney | 150$ |
----------------------------------------------
| E200 | Toronto | 200$ |
----------------------------------------------
| E300 | New York | 400$ |
----------------------------------------------
In order to achieve this, I use the following PostgreSQL query:
SELECT a.customer_id, a.city, SUM(b.receipt)
FROM public.table_1 a
INNER JOIN public.table_2 b
ON a.customer_id = b.customer_id
WHERE b.timestamp > '2019-03-25 00:00:00'
AND b.timestamp < '2019-04-01 00:00:00'
GROUP BY a.customer_id, a.city
However, as table_2 has more than 300mio rows and table_1 has 129 rows, the query is taking too long (I don't know how long exactly -> EXPLAIN ANALYZE on this query wasn't finishing as well). I guess the INNER JOIN is the bottle neck here (please correct me if I am wrong)? But I do know that the query is doing the right thing as I have tried it with filtering just one day (not one week).
My question is how to speed up this query. I have already considered adding an index like this:
CREATE INDEX table_2_index ON table_2(customer_id, timestamp)
But this query is also taking forever.
Any suggestions?
Try to aggregate first, then join:
SELECT a.customer_id, a.city, b.receipt_sum
FROM public.table_1 a
JOIN (
SELECT t2.customer_id, sum(t2.receipt) as receipt_sum
FROM public.table_2 t2
WHERE t2.timestamp > '2019-03-25 00:00:00'
AND t2.timestamp < '2019-04-01 00:00:00'
GROUP BY t2.customer_id
) b ON a.customer_id = b.customer_id
lets try to filter your table_2 table first before joining.
SELECT a.customer_id, a.city, SUM(b.receipt)
FROM public.table_1 a
INNER JOIN
(SELECT receipt, customer_id FROM public.table_2
WHERE timestamp > '2019-03-25 00:00:00'
AND timestamp < '2019-04-01 00:00:00') b ON a.customer_id = b.customer_id
GROUP BY a.customer_id, a.city
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.