简体   繁体   English

提高Postgresql中自我连接的效率

[英]Improving efficiency of a self join in postgresql

I am performing the following query with a self join: 我正在通过自我联接执行以下查询:

with t as (
      SELECT *, TIMESTAMP 'epoch' + tstamp * INTERVAL '1 second' as tstamp2
      FROM
      mytable 
      WHERE id = 'a'
      LIMIT 1000
    )
select v1.id as id, date_trunc('hour', v1.tstamp2) as hour, v1.value as start, v2.value as stop 
from 
    t v1 join 
    t v2 
        on v1.id = v2.id and
        date_trunc('hour', v1.tstamp2) = date_trunc('hour', v2.tstamp2) and
        v1.tstamp2 < v2.tstamp2 
where 1=1
limit 100;

The table looks like that: 该表如下所示:

id   tstamp    value    tstamp2

My goal is to output all the combination of "value" within the same hour for one id. 我的目标是在一个小时内为一个ID输出“值”的所有组合。 I have 100.000 unique ids and millions of rows. 我有100.000个唯一ID和数百万行。 This is extremely slow and inefficient. 这是极其缓慢且效率低下的。 Is there a way to break the query so the self join operates on time partitions (hour by hour for example) to improve speed of such query? 是否有一种方法可以中断查询,以便自连接可以按时间分区(例如每小时一小时)进行操作,以提高查询速度?

I have 100.000 unique ids and millions of rows. 我有100.000个唯一ID和数百万行。

EDIT: I found this which seems to be what I want to do but no idea how to implement that: 编辑:我发现这似乎是我想要做的,但不知道如何实现这一点:

If you know more than you've let on about the properties of the intervals, you might be able to improve things. 如果您对间隔的属性了解不止,您可能可以进行改进。 For instance if the intervals fall into nonoverlapping buckets then you could add a constraint that the buckets of the two sides are equal. 例如,如果间隔属于非重叠存储桶,则可以添加一个约束,即两侧的存储桶相等。 Postgres is a lot better with equality join constraints than it is with range constraints, so it would be able to match up rows and only do the O(N^2) work within each bucket. 具有相等联接约束的Postgres比具有范围约束的Postgres要好得多,因此Postgres能够匹配行并且仅在每个存储桶中执行O(N ^ 2)。

This answers the question as originally tagged -- "Postgres", not "Redshift". 这将回答最初标记为“ Postgres”而不是“ Redshift”的问题。

Unfortunately, Postgres materializes CTEs, which then precludes the use of indexes. 不幸的是,Postgres实现了CTE,从而排除了索引的使用。 You have no ORDER BY in the CTE, so arbitrary rows are being chosen. 您在CTE中没有ORDER BY ,因此可以选择任意行。

One solution is a temporary table and indexes: 一种解决方案是使用临时表和索引:

CREATE TEMPORARY TABLE t as
      SELECT t.*,
             TIMESTAMP 'epoch' + tstamp * INTERVAL '1 second' as tstamp2,
             DATE_TRUNC('hour', 'epoch' + tstamp * INTERVAL '1 second') as tstamp2_hour
      FROM mytable t
      WHERE t.id = 'a'
      LIMIT 1000;

CREATE INDEX t_id_hour_tstamp2 ON t(id, tstamp2_hour, tstamp2);

select v1.id as id, v1.tstamp2_hour as hour, v1.value as start, v2.value as stop 
from t v1 join 
     t v2 
        on v1.id = v2.id and
           v1.tstamp2_hour = v2.tstamp2_hour and
           v1.tstamp2 < v2.tstamp2 
limit 100;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM