[英]Improving efficiency of a self join in postgresql
I am performing the following query with a self join: 我正在通过自我联接执行以下查询:
with t as (
SELECT *, TIMESTAMP 'epoch' + tstamp * INTERVAL '1 second' as tstamp2
FROM
mytable
WHERE id = 'a'
LIMIT 1000
)
select v1.id as id, date_trunc('hour', v1.tstamp2) as hour, v1.value as start, v2.value as stop
from
t v1 join
t v2
on v1.id = v2.id and
date_trunc('hour', v1.tstamp2) = date_trunc('hour', v2.tstamp2) and
v1.tstamp2 < v2.tstamp2
where 1=1
limit 100;
The table looks like that: 该表如下所示:
id tstamp value tstamp2
My goal is to output all the combination of "value" within the same hour for one id. 我的目标是在一个小时内为一个ID输出“值”的所有组合。 I have 100.000 unique ids and millions of rows.
我有100.000个唯一ID和数百万行。 This is extremely slow and inefficient.
这是极其缓慢且效率低下的。 Is there a way to break the query so the self join operates on time partitions (hour by hour for example) to improve speed of such query?
是否有一种方法可以中断查询,以便自连接可以按时间分区(例如每小时一小时)进行操作,以提高查询速度?
I have 100.000 unique ids and millions of rows. 我有100.000个唯一ID和数百万行。
EDIT: I found this which seems to be what I want to do but no idea how to implement that: 编辑:我发现这似乎是我想要做的,但不知道如何实现这一点:
If you know more than you've let on about the properties of the intervals, you might be able to improve things.
如果您对间隔的属性了解不止,您可能可以进行改进。 For instance if the intervals fall into nonoverlapping buckets then you could add a constraint that the buckets of the two sides are equal.
例如,如果间隔属于非重叠存储桶,则可以添加一个约束,即两侧的存储桶相等。 Postgres is a lot better with equality join constraints than it is with range constraints, so it would be able to match up rows and only do the O(N^2) work within each bucket.
具有相等联接约束的Postgres比具有范围约束的Postgres要好得多,因此Postgres能够匹配行并且仅在每个存储桶中执行O(N ^ 2)。
This answers the question as originally tagged -- "Postgres", not "Redshift". 这将回答最初标记为“ Postgres”而不是“ Redshift”的问题。
Unfortunately, Postgres materializes CTEs, which then precludes the use of indexes. 不幸的是,Postgres实现了CTE,从而排除了索引的使用。 You have no
ORDER BY
in the CTE, so arbitrary rows are being chosen. 您在CTE中没有
ORDER BY
,因此可以选择任意行。
One solution is a temporary table and indexes: 一种解决方案是使用临时表和索引:
CREATE TEMPORARY TABLE t as
SELECT t.*,
TIMESTAMP 'epoch' + tstamp * INTERVAL '1 second' as tstamp2,
DATE_TRUNC('hour', 'epoch' + tstamp * INTERVAL '1 second') as tstamp2_hour
FROM mytable t
WHERE t.id = 'a'
LIMIT 1000;
CREATE INDEX t_id_hour_tstamp2 ON t(id, tstamp2_hour, tstamp2);
select v1.id as id, v1.tstamp2_hour as hour, v1.value as start, v2.value as stop
from t v1 join
t v2
on v1.id = v2.id and
v1.tstamp2_hour = v2.tstamp2_hour and
v1.tstamp2 < v2.tstamp2
limit 100;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.