简体   繁体   English

如何在 TimeScaleDB、PostgreSQL 的 time_bucket_gapfill() 中将 AVG() 与 GROUP BY 一起使用?

[英]How do I use AVG() with GROUP BY in time_bucket_gapfill() in TimeScaleDB, PostgreSQL?

I'm using TimescaleDB in my PostgreSQL and I have the following two Tables:我在我的 PostgreSQL 中使用 TimescaleDB,我有以下两个表:

windows_log windows_log

| windows_log_id |      timestamp      | computer_id | log_count |
------------------------------------------------------------------
|        1       | 2021-01-01 00:01:02 |     382     |     30    |
|        2       | 2021-01-02 14:59:55 |     382     |     20    |
|        3       | 2021-01-02 19:08:24 |     382     |     20    |
|        4       | 2021-01-03 13:05:36 |     382     |     10    |
|        5       | 2021-01-03 22:21:14 |     382     |     40    |

windows_reliability_score windows_reliability_score

| computer_id (FK) |      timestamp      | reliability_score |
--------------------------------------------------------------
|        382       | 2021-01-01 22:21:14 |          6        |
|        382       | 2021-01-01 22:21:14 |          6        |
|        382       | 2021-01-01 22:21:14 |          6        |
|        382       | 2021-01-02 22:21:14 |          1        |
|        382       | 2021-01-02 22:21:14 |          3        |
|        382       | 2021-01-03 22:21:14 |          7        |
|        382       | 2021-01-03 22:21:14 |          8        |
|        382       | 2021-01-03 22:21:14 |          9        |

Note: In both tables is indexed on the timestamp column (hypertable)注意:在两个表中都索引了时间戳列(超表)

So I'm trying to get the average reliability_score for each time bucket, but it just gives me the average for everything, instead of the average per specific bucket...所以我试图获得每个时间段的平均可靠性分数,但它只是给了我所有内容的平均值,而不是每个特定时间段的平均值......

This is my query:这是我的查询:

SELECT time_bucket_gapfill(CAST(1 * INTERVAL '1 day' AS INTERVAL), wl.timestamp) AS timestamp, 
COALESCE(SUM(log_count), 0) AS log_count,
AVG(reliability_score) AS reliability_score
FROM windows_log wl
JOIN reliability_score USING (computer_id)
WHERE wl.time >= '2021-01-01 00:00:00.0' AND wl.time < '2021-01-04 00:00:00.0'
GROUP BY timestamp
ORDER BY timestamp asc

This is the result I'm looking for:这是我正在寻找的结果:

|      timestamp      | log_count | reliability_score |
-------------------------------------------------------
| 2021-01-01 00:00:00 |     30    |          6        |
| 2021-01-02 00:00:00 |     20    |          2        |
| 2021-01-03 00:00:00 |     20    |          8        |

But this is what I get:但这就是我得到的:

|      timestamp      | log_count | reliability_score |
-------------------------------------------------------
| 2021-01-01 00:00:00 |     30    |        5.75       |
| 2021-01-02 00:00:00 |     20    |        5.75       |
| 2021-01-03 00:00:00 |     20    |        5.75       |

The main issue is that the join codition is on column computer_id , where both tables have the same values 382 .主要问题是连接代码在列computer_id ,其中两个表具有相同的值382 Thus each column from table windows_log will be joined with each column from table reliability_score (Cartesian product of all rows).因此表windows_log每一列都将与表reliability_score (所有行的笛卡尔积)中的每一列连接起来。 Also the grouping is done on column timestamp , which is ambigous, and is likely to be resolved to timestamp from windows_log .另外,分组对列进行timestamp ,这是ambigous,并有可能被解析为timestampwindows_log This leads to the result that average will use all values of reliability_score for each value of the timestamp from windows_log and explains the undesired result.这导致平均将使用所有来自windows_log的时间戳值的reliability_score的值并解释不希望的结果。

The resolution of the gropuing ambiguity, which resolved in favor the inner column, ie, the table column, is explained in GROUP BY description in SELECT documentation :SELECT文档中的GROUP BY描述中解释了 gropuing 歧义的解决,该解决有利于内部列,即表列:

In case of ambiguity, a GROUP BY name will be interpreted as an input-column name rather than an output column name.如果出现歧义,GROUP BY 名称将被解释为输入列名称而不是输出列名称。

To avoid having groups, which includes all rows matching on computer id, windows_log_id can be used for grouping.为避免出现组,其中包括与计算机 ID 匹配的所有行, windows_log_id可用于分组。 This will allow to bring log_count to the query result.这将允许将log_count查询结果。 And if it is desire to keep the output name timestamp , GROUP BY should use the reference to the position.如果希望保留输出名称timestamp , GROUP BY 应该使用对位置的引用。 For example:例如:

SELECT time_bucket_gapfill('1 day'::INTERVAL, rs.timestamp) AS timestamp, 
AVG(reliability_score) AS reliability_score,
log_count
FROM windows_log wl
JOIN reliability_score rs USING (computer_id)
WHERE rs.timestamp >= '2021-01-01 00:00:00.0' AND rs.timestamp < '2021-01-04 00:00:00.0'
GROUP BY 1, windows_log_id, log_count
ORDER BY timestamp asc

For ORDER BY it is not an issue, since the output name is used.对于 ORDER BY 这不是问题,因为使用了输出名称。 From the same doc:来自同一个文档:

If an ORDER BY expression is a simple name that matches both an output column name and an input column name, ORDER BY will interpret it as the output column name.如果 ORDER BY 表达式是一个与输出列名称和输入列名称都匹配的简单名称,则 ORDER BY 会将其解释为输出列名称。

Given what we can glean from your example, there's no simple way to do a join between these two tables, with the given functions, and achieve the results you want.鉴于我们可以从您的示例中收集到的信息,没有简单的方法可以使用给定的函数在这两个表之间进行连接并获得您想要的结果。 The schema, as presented, just makes that difficult.所呈现的模式只是让这变得困难。

If this is really what your data/schema look like, then one solution is to use multiple CTE's to get the two values from each distinct table and then join based on bucket and computer.如果这确实是您的数据/模式的样子,那么一种解决方案是使用多个 CTE 从每个不同的表中获取两个值,然后根据存储桶和计算机进行连接。

WITH wrs AS (
    SELECT time_bucket_gapfill('1 day', timestamp) AS bucket, 
    computer_id,
    AVG(reliability_score) AS reliability_score  
    FROM windows_reliability_score
    WHERE timestamp >= '2021-01-01 00:00:00.0' AND timestamp < '2021-01-04 00:00:00.0'
    GROUP BY 1, 2
),
wl AS (
    SELECT time_bucket_gapfill('1 day', wl.timestamp) bucket, wl.computer_id,
    sum(log_count) total_logs
    FROM windows_log wl
    WHERE timestamp >= '2021-01-01 00:00:00.0' AND timestamp < '2021-01-04 00:00:00.0'
    GROUP BY 1, 2
)
SELECT wrs.bucket, wrs.computer_id, reliability_score, total_logs
FROM wrs LEFT JOIN wl ON wrs.bucket = wl.bucket AND wrs.computer_id = wl.computer_id;

The filtering would have to be applied internally to each query because pushdown on the outer query likely wouldn't happen and so then you would scan the entire hypertable before the date filter is applied (not what you want I assume).过滤必须在内部应用于每个查询,因为可能不会发生对外部查询的下推,因此您将在应用日期过滤器之前扫描整个超表(不是我假设的您想要的)。

I tried to quickly re-create your sample schema, so I apologize if I got names wrong somewhere.我试图快速重新创建您的示例架构,所以如果我在某处弄错了名字,我深表歉意。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM