简体   繁体   中英

How can I efficiently paginate the results of a complex SQL query?

I have a fairly complex SQL query which first fetches some data into a CTE and then performs several self-joins on the CTE in order to compute a value. Here's an abberivated exmaple, with some complexities of our application simplified:

WITH subset AS (
  SELECT time, value, device_id FROM raw_data
  WHERE device_id IN (1, 2, 3)
  AND time BETWEEN '2019-01-01 00:00:00'::timestamp AND '2019-01-15 00:00:00'::timestamp
)

SELECT
  time,
  (("device_1".value + "device_2".value) / "device_3".value) as value
FROM 

(
  SELECT * FROM subset 
  WHERE device_id = 1
) "device_1"

INNER JOIN
(
  SELECT * FROM subset 
  WHERE device_id = 2
) "device_2"
ON "device_1".time = "device_2".time

INNER JOIN
(
  SELECT * FROM subset 
  WHERE device_id = 3
) "device_3"
ON "device_3".time = "device_2".time

The query is auto-generated and can scale to a complex computation over the values of potentially tens of devices. For performance reasons, we would like to paginate the results of this query, since the time range used could be large. A key constraint is that the data could have gaps in time, but we want to return a constant number of rows per page.

We have considered using LIMIT per_page OFFSET start at the end of the query which would be the standard approach, but this doesn't buy us any speed up and the query performs the same. This makes sense, because in this case the LIMIT/OFFSET is performed after all the data has been fetched, joined, and computed and it just returns a slice of the data which is already computed. This doesn't appreciably reduce how fast the query runs.

We have considered paginating the data fetched into the CTE, ie computing what time range corresponds to the page of interest, and then using that time range in the BETWEEN clause of the CTE. This would work, but the problem is we cannot reliably compute this time range, as some of the variables could have gaps in them. Therefore, if we compute 100 rows to be a window of 2 days and we fetch 2 days, there's a possibility we get less than 100 rows if device_2 did not record data at some point in that window. For the computation, those data points would be dropped in the INNER JOINS.

The question is, is there an efficient way to paginate this query or restructure it to enable fast pagination, given these constraints? For example, is there some way to instruct the query planner to "join until you match 100 results matching the join conditions, and stop there". We are running this on PostgreSQL, if that makes a difference.

Assuming your use case can tolerate not having the absolute latest data all the time, you could consider creating a materialized view:

WITH subset AS ( ... )

CREATE MATERIALIZED VIEW yourView AS SELECT ...

Paginating a materialized view using LIMIT and OFFSET should be faster than running the full query from scratch each time. The downside here would be that you would be returning data from a view, which would have to be updated at some frequency which meets your requirements.

As an alternative/in addition to a materialized view, you could consider tuning the query using indices. For example, an index which would speed up the subset CTE query might be:

CREATE INDEX idx1 ON raw_data (time, device_id, value);

Or maybe:

CREATE INDEX idx2 ON raw_data (device_id, time, value);

1) Create a composite index with the following order device_id and time desc.

2) Try to generate a query in this way

select device_1.time,
 (("device_1".value + "device_2".value) / "device_3".value) as value 
from raw_data as device_1 ,raw_data as device_2 ,raw_data as device_3 
where device_1.devise_id = 1 
and device_2.devise_id = 2 
and device_3.devise_id = 3 
and device_1.time BETWEEN '2019-01-01 00:00:00'::timestamp AND '2019-01-15 00:00:00'::timestamp 
and device_2.time BETWEEN '2019-01-01 00:00:00'::timestamp AND '2019-01-15 00:00:00'::timestamp 
and device_3.time BETWEEN '2019-01-01 00:00:00'::timestamp AND '2019-01-15 00:00:00'::timestamp 
and device_1.time = device_2.time 
and device_2.time = device_3.time

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM