在 postgres 中使用排序查询性能

Question

I've a performance issue with a query on a table which has 33m rows.我对具有 33m 行的表的查询存在性能问题。 The query should return 6m rows.查询应返回 6m 行。 I'm trying to achive that the response to the request to begin without any significant delay.我试图在没有任何明显延迟的情况下实现对请求的响应。 It's required for data streaming in my app.我的应用程序中的数据流需要它。 After the start, the data transfer may take longer.启动后，数据传输可能需要更长的时间。 The difficult is the query has sorting.困难在于查询有排序。 So, I created an Index with fields that are used in the "order by" statement and in the "where" clause.因此，我创建了一个索引，其中包含在“order by”语句和“where”子句中使用的字段。

Example likes that:示例喜欢这样：

CREATE TABLE Table1 (
   Id SERIAL PRIMARY KEY,
   Field1 INT NOT NULL,
   Field2 INT NOT NULL,
   Field3 INT NOT NULL,
   Field4 VARCHAR(200) NOT NULL,
   CreateDate TIMESTAMP,
   CloseDate TIMESTAMP NULL
);
CREATE INDEX IX_Table1_SomeIndex ON Table1 (Field2, Field4);

And query likes that:查询喜欢这样：

SELECT * FROM Table1 t
WHERE t.CreateDate >= '2020-01-01' AND t.CreateDate < '2021-01-01'
ORDER BY t.Field2, t.Field4

It leads to the following: when I add "LIMIT 1000" it retruns result immediately and builds the following plan: the plan with 'LIMIT'它导致以下结果：当我添加“LIMIT 1000”时，它会立即返回结果并构建以下计划：带有“LIMIT”的计划

when I run without "LIMIT" it "thinks" for about a minute and returns data for about 16 minutes.当我在没有“LIMIT”的情况下运行时，它会“思考”大约一分钟并返回数据大约 16 分钟。 And it builds the following plan: the plan with 'LIMIT'它构建了以下计划：带有“LIMIT”的计划

Why are plans different?为什么计划不同？

Could you help me to make souliton for streaming immediately (without LIMIT)?你能帮我立即制作souliton（没有限制）吗？

Thanks!谢谢！

Answer 1

If "when I add LIMIT 1000 it returns result immediately" and you want to avoid latency then I would suggest that you run a slightly modified query many times in a loop with LIMIT 1000 .如果“当我添加 LIMIT 1000 时，它会立即返回结果”并且您想避免延迟，那么我建议您在LIMIT 1000循环中多次运行稍微修改过的查询。 An important benefit would be that there will be no long running transactions.一个重要的好处是不会有长时间运行的事务。

The query to run many times in a loop should return records starting after the largest value of (field2, field4) from the previous iteration run.在循环中多次运行的查询应该返回从上一次迭代运行的(field2, field4)的最大值之后开始的记录。

SELECT * 
  FROM table1 t
 WHERE t.CreateDate >= '2020-01-01' AND t.CreateDate < '2021-01-01'
   AND (t.field2, t.field4) > (:last_run_largest_f2_value, :last_run_largest_f4_value) 
 ORDER BY t.field2, t.field4
 LIMIT 1000;

last_run_largest_f2_value and last_run_largest_f4_value are parameters. last_run_largest_f2_value和last_run_largest_f4_value是参数。 Their values shall come from the last record returned by the previous iteration.它们的值应来自上一次迭代返回的最后一条记录。
AND (t.field2, t.field4) > (:last_run_largest_f2_value, :last_run_largest_f4_value) shall be omitted in the first iteration. AND (t.field2, t.field4) > (:last_run_largest_f2_value, :last_run_largest_f4_value)在第一次迭代中应省略。

Important limitation重要限制

This is an alternative of OFFSET that will work correctly if (field2, field4) values are unique这是OFFSET的替代方法，如果(field2, field4)值是唯一的，它将正常工作

Answer 2

You will need to use a server side cursor or something similar for this to work.您将需要使用服务器端 cursor 或类似的东西才能工作。 Otherwise it runs the query to completion before returning any results.否则，它会在返回任何结果之前运行查询完成。 There is no "streaming" by default.默认情况下没有“流媒体”。 How you do this depends on your client, which you don't mention.你如何做到这一点取决于你的客户，你没有提到。

If you simply DECLARE a cursor and then FETCH in chunks, then the setting cursor_tuple_fraction will control whether it chooses the plan with a faster start up cost (like what you get with the LIMIT), or a faster overall run cost (like you get without the LIMIT).如果您只是简单地 DECLARE 一个 cursor 然后 FETCH 在块中，那么设置cursor_tuple_fraction将控制它是否选择具有更快启动成本的计划（例如您使用 LIMIT 获得的）或更快的总体运行成本（就像您没有极限）。

在 postgres 中使用排序查询性能

问题描述

2 个解决方案

解决方案1
0 2021-01-30 19:07:31

解决方案2
0 2021-01-30 19:16:15

在 postgres 中使用排序查询性能

问题描述

2 个解决方案

解决方案1 0 2021-01-30 19:07:31

解决方案2 0 2021-01-30 19:16:15

解决方案1
0 2021-01-30 19:07:31

解决方案2
0 2021-01-30 19:16:15