简体   繁体   中英

SQL: calculate monthly averages from arbitrary intervals

I have a log table that stores events in the form of

timestamp,        object_id, state
2018-08-12 13:45  123        10
2018-08-13 15:56  183        25
2018-08-13 15:58  123        10
2018-08-15 16:02  256        15

There is a primary key (not included for brevity), the timestamp is a datetime field, object_id is a foregn key relationship to a diffent table and state is an integer in 0-100 range. The events are recorded as they come in and state doesn't necessarily change between events, so the same object_id might have multiple consecutive records with the same state.

The database is PostgreSQL 9.5

What I am trying to do is calculate average state for monthly, daily and weekly intervals for individual objects or objects selected by some criteria. The results I expect for daily averages should look something like

date,        object_id, average state
2018-08-12   123        18.6
2018-08-13   123        37.1
2018-08-14   123        126.7
2018-08-15   123        5.5

where average state is calculated weighted by the amount of time the object spent in each given state during the interval (in the case above during one day) in one minute intervals, so if an object spends 23 hours in state 10, but 15 minutes in state 50, the average should be

15/1440 * 50 + 1425/1440 * 10 = 10.42

So far, I have managed to use window functions to convert individual events into intervals between state changes. The SQL looks something like this

SELECT
    state.object_id,
    state.timestamp as start, 
    lead(timestamp) OVER (ORDER BY timestamp) as end,
    state.state, 
FROM 
(
    SELECT 
        *, 
        rank() OVER (PARTITION BY (state) ORDER BY timestamp)
    FROM event_log AS l
    WHERE object_id=123 AND timestamp >= DATE '2018-01-01'
) AS state
WHERE state.rank=1
ORDER BY timestamp

and get the output that gives me start and end of intervals when the state actually changes. I am not sure where to go from here. The events do not always come frequently, so I might have an interval that lasts three days and I somehow need to report it on day by day basis, so I need to split that interval into days. How do I go about this the right way?

Well, one method to calculate that average would be to actually unroll all the minutes using generate_series() , assign the state to them with a subquery and then GROUP BY ID and day.

SELECT date_trunc('day',
                  "gs"."timestamp") "date",
       "x1"."object_id",
       avg((SELECT "el1"."state"
                   FROM "event_log" "el1"
                   WHERE "el1"."object_id" = "x1"."object_id"
                         AND "el1"."timestamp" <= "gs"."timestamp"
                   ORDER BY "el1"."timestamp" DESC
                   LIMIT 1)) "state"
       FROM (SELECT "el1"."object_id",
                    min(date_trunc('minute',
                                   "el1"."timestamp")) "timestamp_begin",
                    max(date_trunc('minute',
                                   "el1"."timestamp")) "timestamp_end"
                    FROM "event_log" "el1"
                    GROUP BY "el1"."object_id") "x1"
             CROSS JOIN LATERAL generate_series("x1"."timestamp_begin",
                                                "x1"."timestamp_end",
                                                '1 minute'::interval) "gs"("timestamp")
       GROUP BY date_trunc('day',
                           "gs"."timestamp"),
                "x1"."object_id"
       ORDER BY date_trunc('day',
                           "gs"."timestamp"),
                "x1"."object_id";

db<>fiddle

Result:

date                | object_id |               state
:------------------ | --------: | ------------------:
2018-08-12 00:00:00 |       123 | 10.0000000000000000
2018-08-13 00:00:00 |       123 | 10.0000000000000000
2018-08-13 00:00:00 |       183 | 25.0000000000000000
2018-08-15 00:00:00 |       256 | 15.0000000000000000

The idea is to generate all minutes between the first and last timestamp of an object. And the assign the latest known state to a minute, that was logged before or at that minute.

If we have each minute and a state it's a more or less simple aggregation query to get the averages per day and object.

First we get the first and last timestamp exact to the minute for each object with the subquery aliased "x1" . To truncate the timestamps to minute precision we use date_trunc() .

We lateral cross join "x1" with generate_series() and feed it the first and last minute. This will generate the minutely timestamps from the first to the last one.

Now in the subquery in the avg() call, we select all rows, where the object is the same as the current row in the outer query and the timestamp is less than or equal to the one of the current row. But we only want the latest of these. So we sort them by the timestamp in descending order an pick just the first one from the sorted ones.

We again use date_trunc() to now truncate the minutes to days and group by them and the object.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM