简体   繁体   English

SQL将时间间隔按任意时间段分组

[英]SQL to group time intervals by arbitrary time period

I need help with this SQL query. 我需要有关此SQL查询的帮助。 I have a big table with the following schema: 我有一个具有以下架构的大表:

  • time_start (timestamp) - start time of the measurement, time_start (时间戳)-测量的开始时间,
  • duration (double) - duration of the measurement in seconds, duration (两倍)-测量duration (以秒为单位),
  • count_event1 (int) - number of measured events of type 1, count_event1 (int)-类型1的已测量事件数
  • count_event2 (int) - number of measured events of type 2 count_event2 (int)-类型2的已测量事件数

I am guaranteed that the no rows will overlap - in SQL talk, there are no two rows such that time_start1 < time_start2 AND time_start1 + duration1 > time_start2 . 我保证没有行会重叠 -在SQL对话中,没有两行time_start1 < time_start2 AND time_start1 + duration1 > time_start2

I would like to design an efficient SQL query which would group the measurements by some arbitrary time period (I call it the group_period ), for instance 3 hours . 我想设计一个高效的SQL查询这将组由某个任意的时间段进行测量(I称它为group_period ),例如3小时 I have already tried something like this: 我已经尝试过这样的事情:

SELECT
    ROUND(time_start/group_period,0) AS time_period,
    SUM(count_event1) AS sum_event1,
    SUM(count_event2) AS sum_event2 
FROM measurements
GROUP BY time_period;

However, there seems to be a problem. 但是,似乎有问题。 If there is a measurement with duration greater than the group_period , I would expect such measurement to be grouped into all time period it belongs to, but since the duration is never taken into account, it gets grouped only into the first one. 如果存在一个duration大于group_period ,那么我希望这种度量会被归类到它所属的所有时间段,但是由于从不考虑持续时间,因此它只会被分组为第一个度量值。 Is there a way to fix this? 有没有办法解决这个问题?

Performance is of concern to me because in time, I expect the table size to grow considerably reaching millions, possibly tens or hundreds of millions of rows. 性能是我关注的问题,因为随着时间的推移,我希望表的大小会显着增长,达到数百万行,可能是数千万或数亿行。 Do you have any suggestions for indexes or any other optimizations to improve the speed of this query? 您对索引有任何建议或任何其他优化措施来提高此查询的速度吗?

Based on Timekiller's advice, I have come up with the following query: 根据Timekiller的建议,我提出了以下查询:

-- Since there's a problem with declaring variables in PostgreSQL,
-- we will be using aliases for the arguments required by the script.

-- First some configuration:
--   group_period = 3600   -- group by 1 hour (= 3600 seconds)
--   min_time = 1440226301 -- Sat, 22 Aug 2015 06:51:41 GMT
--   max_time = 1450926301 -- Thu, 24 Dec 2015 03:05:01 GMT

-- Calculate the number of started periods in the given interval in advance.
--   period_count = CEIL((max_time - min_time) / group_period)

SET TIME ZONE UTC;
BEGIN TRANSACTION;

-- Create a temporary table and fill it with all time periods.
CREATE TEMP TABLE periods (period_start TIMESTAMP)
    ON COMMIT DROP;
INSERT INTO periods (period_start)
    SELECT to_timestamp(min_time + group_period * coefficient)
    FROM generate_series(0, period_count) as coefficient;

-- Group data by the time periods.
-- Note that we don't require exact overlap of intervals:
--   A. [period_start, period_start + group_period]
--   B. [time_start, time_start + duration]
-- This would yield the best possible result but it would also slow
-- down the query significantly because of the part B.
-- We require only: period_start <= time_start <= period_start + group_period
SELECT
    period_start,
    COUNT(measurements.*) AS count_measurements,
    SUM(count_event1) AS sum_event1,
    SUM(count_event2) AS sum_event2
FROM periods
LEFT JOIN measurements
ON time_start BETWEEN period_start AND (period_start + group_period)
GROUP BY period_start;

COMMIT TRANSACTION;

It does exactly what I was going for, so mission accomplished. 它正是我想要的,所以完成了任务。 However, I would still appreciate if anybody could give me some feedback to the performance of this query for the following conditions: 但是,如果有人在以下情况下可以给我一些有关此查询性能的反馈,我仍然感激不尽:

  • I expect the measurements table to have about 500-800 million rows. 我希望measurements量表大约有500至8亿行。
  • The time_start column is primary key and has unique btree index on it. time_start列是主键,并且具有唯一的btree索引。
  • I have no guarantees about min_time and max_time . 我不能保证有关min_timemax_time I only know that group period will be chosen so that 500 <= period_count <= 2000 . 我只知道会选择组期间,以便500 <= period_count <= 2000

(This turned out way too large for a comment, so I'll post it as an answer instead). (结果太大了,无法发表评论,因此我将其发布为答案)。

Adding to my comment on your answer, you probably should go with getting best results first and optimize later if it turns out to be slow. 除了对您的答案的评论外,如果结果很慢,您可能应该先获得最佳结果,然后再进行优化。

As for performance, one thing I've learned while working with databases is that you can't really predict performance. 至于性能,我在使用数据库时学到的一件事是您无法真正预测性能。 Query optimizers in advanced DBMS are complex and tend to behave differently on small and large data sets. 高级DBMS中的查询优化器很复杂,并且在大小数据集上的行为往往有所不同。 You'll have to get your table filled with some large sample data, experiment with indexes and read the results of EXPLAIN , there's no other way. 您将不得不在表中填充一些大样本数据,尝试使用索引并读取EXPLAIN的结果,没有其他方法。

There are a few things to suggest, though I know Oracle optimizer much better than Postgres, so some of them might not work. 有几点建议,尽管我知道Oracle优化器比Postgres好得多,所以其中一些可能不起作用。

  • Things will be faster if all fields you're checking against are included in the index. 如果您要检查的所有字段都包含在索引中,事情将会更快。 Since you're performing a left join and periods is a base, there's probably no reason to index it, since it'll be included fully either way. 由于您执行的是左联接,而periods是基础,因此可能没有理由对其进行索引,因为无论哪种方式都将其完全包含在内。 duration should be included in the index though, if you're going to go with proper interval overlap - this way, Postgres won't have to fetch the row to calculate the join condition, index will suffice. duration应该包括在索引中,但是,如果您要以适当的间隔重叠进行操作-这样,Postgres不必获取行来计算连接条件,索引就足够了。 Chances are it will not even fetch the table rows at all since it needs no other data than what exists in indexes. 由于它除了索引中不需要的其他数据外,甚至根本不会提取表行。 I think it'll perform better if it's included as the second field to time_start index, at least in Oracle it would, but IIRC Postgres is able to join indexes together, so perhaps a second index would perform better - you'll have to check it with EXPLAIN . 认为 ,如果将它作为time_start索引的第二个字段包括在内,它的性能会更好,至少在Oracle中会如此,但是IIRC Postgres能够将索引连接在一起,因此也许第二个索引的性能会更好-您必须检查它与EXPLAIN

  • Indexes and math don't mix well. 索引和数学混合不好。 Even if duration is included in the index, there's no guarantee it will be used in (time_start + duration) - though, again, look at EXPLAIN first. 即使duration包含在索引中,也不能保证会在(time_start + duration)使用它-尽管再次,请首先查看EXPLAIN If it's not used, try to either create a function-based index (that is, include time_start + duration as a field), or alter the structure of the table a bit, so that time_start + duration is a separate column, and index that column instead. 如果未使用它,请尝试创建基于函数的索引(即,将time_start + duration作为字段),或稍微更改表的结构,以便time_start + duration是一个单独的列,并对该索引进行索引列代替。

  • If you don't really need left join (that is, you're fine with missing empty periods), then use inner join instead - optimizer will likely start with a larger table (measurements) and join periods against it, possibly using hash join instead of nested loops. 如果您确实不需要左连接(也就是说,您可以使用空连接,则可以使用空连接),而不是使用内部连接-优化器可能会从一个较大的表(度量)开始并针对它进行连接,可能使用哈希连接而不是嵌套循环。 If you do that, than you should also index your periods table in the same fashion, and perhaps restructure it the same way, so that it contains start and end periods explicitly, as optimizer has even more options when it doesn't have to perform any operations on the columns. 如果这样做,则还应该以相同的方式索引期间表,并可能以相同的方式对其进行重组,以便它明确包含开始和结束期间,因为优化器在不必执行时有更多选择列上的任何操作。

  • Perhaps the most important, if you have max_time and min_time , USE IT to limit the results of measurements before joining! 也许最重要,如果您有max_timemin_time ,请在加入前使用IT来限制measurements结果! The smaller your sets, the faster it will work. 您的集合越小,它将起作用的越快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM