简体   繁体   中英

Time-based grouped window functions in timescaledb/postgres

I have a hypertable in timescaledb called prices that looks like the following:

ts instrumentId value
08:00:01 A 100
08:00:01 B 200
08:00:02 B 205
08:00:04 A 95
08:00:06 C 300
08:00:07 A 90

Where each row is a measurement ( value ) at a given timestamp ( ts ) for a given instrument ( instrumentId ). Each instrument is coming in at different times, and I have no guarantee that I'll have a datapoint for a given instrument at a given timestamp.

My hypertable is using the timestamp column as my time index, and for additional context, I have ~200M rows spanning several years of sub-second data for dozens of instruments.

I want to be able to generate 5 second unweighted historical means for each instrument for each measurement that I have. For each row in my table, I want to be able to find all of the rows which have the same instrument, and have a timestamp between (row_timestamp, row_timestamp - 5 seconds) (This set would include the row that we're generating the data for). I'd then like to take the average of that, but it's reasonable that I'd want to take other statistics from this as well (stdev, sum, etc).

The output of this query for the above example table would look like the below:

ts instrumentId avg_5s_window_value
08:00:01 A 100
08:00:01 B 200
08:00:02 B 202.5
08:00:04 A 97.5
08:00:06 C 300
08:00:07 A 92.5

I can achieve this in small batches of data by joining the table against itself, however this a very inefficient solution, and I know that the correct way to achieve this would have some kind of deque under the hood. The query for this looks like the following:

WITH lhs AS (
    SELECT 
        ts, instrumentId, value, rank() OVER (ORDER BY ts)
    FROM prices
),
splay AS (
    SELECT
        lhs.instrumentId, lhs.ts, lhs.rank, rhs.value
    FROM lhs as rhs
    JOIN lhs
    ON 
        lhs.instrumentId=rhs.instrumentId
        AND rhs.ts BETWEEN lhs.ts - INTERVAL '5s' AND lhs.ts
    ORDER BY lhs.instrumentId, lhs.rank, rhs.rank
)
SELECT
    MAX(instrumentId), min(ts), AVG(value)
FROM splay
GROUP BY rank;

The above takes ~2 minutes to run for a single day of data, but I can do the operation in pandas on 1 year of data in ~45 seconds, so I'm confident there's a better way to do it in SQL.

How can I achieve efficient grouped window functions in timescaledb/postgres?

I think what you're looking for is time_bucket with a group by, so something like:

SELECT time_bucket('5s'::interval, ts), instrumentID, avg(value)
FROM prices
GROUP BY time_bucket('5s'::interval, ts), instrumentID;

That'll group by the timestamp at the beginning of the range, if you want the end, you can just add '5s' to it...

If you're trying to do this for each row in your original dataset, then you'd use a window function with a partition by clause and range clause like so:

SELECT *, avg(value) OVER (PARTITION BY instrumentId ORDER BY ts RANGE '5s' PRECEDING)
FROM prices; 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM