简体   繁体   中英

Postgresql: create a query that uses generate_series with an interval that correctly takes DST changes into account and flattens on true calendar days

As a followup to a comment on my question at Is this query that tries to get timeseries statuses with truncated dates even possible in regular relational databases? I have implemented a timeseries query on postgres that works reasonably well. It flattens time on whole periods (like days) and joins it with some data.

There is a major problem with it though: the query is timezone-dependent which works fine, but when a Daylight Savings Time (DST) happens in the middle of the generated series, this is not reflected in the output. In some timezones it unfortunately is the case that 1 day in the year takes only 23 hours and another day takes 25 hours. I need the data to be aggregated on this 23 or 25 hour period, because those are true calendar days in that timezone. But with the current query it just always adds 1 day to the series. This means that during a DST switch, I get output with data like:

date 1: 00:00
date 2: 00:00
date 3: 00:00
(now a DST change happens)
date 3: 23:00
date 4: 23:00
... and so on

I'm at a loss on how to rewrite this query to take into account that certain days take less or more hours in some timezones. Because the generate_series is based on intervals. Any ideas? The actual code has an arbitrary period and amount btw, it could also be 5 months or 3 hours.

Here's the full query, though I imagine only the sub1 query is relevant.

SELECT sub2.fromdate,
       sub2.eventlevel,
       sub2.count
FROM
  (SELECT sub1.fromdate AS fromdate,
          sub1.maxeventlevel AS eventlevel,
          count(*) AS COUNT
   FROM
     (SELECT e.subject_id,
             MAX(e.event_level) AS maxeventlevel,
             d.date AS fromdate
      FROM
        (SELECT generate_series(date_trunc(?, ? AT TIME ZONE ?) AT TIME ZONE ?, date_trunc(?, ? AT TIME ZONE ?) AT TIME ZONE ? , interval '1' DAY)) d(date)
      INNER JOIN event e ON ((e.end_date > d.date
                              AND e.end_date > ?)
                             OR e.end_date IS NULL)
      AND e.date < d.date + interval '1' DAY
      AND e.date < ?
      AND d.date < ?
      INNER JOIN subject ON subject.id = e.subject_id
      INNER JOIN metric ON metric.id = e.metric_id
      INNER JOIN event_configuration_version ON event_configuration_version.id = e.event_configuration_version_id
      INNER JOIN event_configuration ON event_configuration.id = event_configuration_version.event_configuration_id
      WHERE subject.project_id = ?
      GROUP BY e.subject_id,
               fromdate) AS sub1
   GROUP BY sub1.fromdate,
            sub1.maxeventlevel) AS sub2
ORDER BY sub2.fromdate,
         sub2.eventlevel DESC

I don't think I can do anything in code after the query has already been performed, but I'm open to any code solutions that I've missed, though ideally we get the results back correctly from the SQL query itself. We do need to do most of the aggregation in the database itself but if there's something smart that that can be done elsewhere then that works too. The Java code generating and executing this query and transforming the result runs in a Spring Boot application and looks as follows:

public PeriodAggregationDTO[] getSubjectStatesReport(
    AggregationPeriod aggregationPeriod, Integer aggregationPeriodAmount, UUID projectId,
    List<UUID> eventTriggerIds, List<UUID> subjectIds, List<UUID> metricIds, List<EventLevel> eventLevels,
    Date fromDate, Date toDate) {
    // to avoid an even more complex native query, we obtain the project here so a) we are sure
    // that this user has access
    // and b) we can get the timezone already without additional joins later.

    Project project = serviceUtil.findProjectByIdOrThrowApiException(projectId);
    String timezoneId = project.getTimezoneId();

    boolean skipEventTriggers = eventTriggerIds == null || eventTriggerIds.size() == 0;
    boolean skipSubjects = subjectIds == null || subjectIds.size() == 0;
    boolean skipMetrics = metricIds == null || metricIds.size() == 0;
    boolean skipEventLevels = eventLevels == null || eventLevels.size() == 0;

    StringBuilder whereClause = new StringBuilder();
    whereClause.append(" WHERE subject.project_id = :projectId");
    if (!skipEventTriggers) {
        whereClause.append(" AND event_trigger.id in :eventTriggerIds");
    }
    if (!skipSubjects) {
        whereClause.append(" AND subject_id in :subjectIds");
    }
    if (!skipMetrics) {
        whereClause.append(" AND metric.id in :metricIds");
    }
    if (!skipEventLevels) {
        whereClause.append(" AND e.event_level in :eventLevels");
    }

    String interval = String.format("'%d' %s", aggregationPeriodAmount, aggregationPeriod);

    String series = "SELECT generate_series("
        + "date_trunc(:period, :fromDate AT TIME ZONE :timezoneId) AT TIME ZONE :timezoneId"
        + " , date_trunc(:period, :toDate AT TIME ZONE :timezoneId) AT TIME ZONE :timezoneId"
        + " , interval " + interval + ")";

    String innersubquery = "SELECT e.subject_id" + ",MAX(e.event_level) as maxeventlevel"
        + ",d.date as fromdate"
        + " FROM (" + series + " ) d(date)"
        + " INNER JOIN event e ON ((e.end_date > d.date AND e.end_date > :fromDate)"
        + " OR e.end_date IS NULL) AND e.date < d.date + interval " + interval
        + " AND e.date < :toDate AND d.date < :toDate"
        + " INNER JOIN subject ON subject.id = e.subject_id"
        + " INNER JOIN metric ON metric.id = e.metric_id"
        + " INNER JOIN event_trigger_version ON event_trigger_version.id = e.event_trigger_version_id"
        + " INNER JOIN event_trigger ON event_trigger.id = event_trigger_version.event_trigger_id"
        + whereClause.toString()
        + " GROUP BY e.subject_id, fromdate";

    String outersubquery = "SELECT" + " sub1.fromdate as fromdate"
        + ",sub1.maxeventlevel as eventlevel" + ",count(*) as count" + " FROM"
        + " (" + innersubquery + ") AS sub1"
        + " GROUP BY sub1.fromdate, sub1.maxeventlevel";

    String queryString = "SELECT sub2.fromdate, sub2.eventlevel, sub2.count FROM ("
        + outersubquery + ") AS sub2"
        + " ORDER BY sub2.fromdate, sub2.eventlevel DESC";

    Query query = em.createNativeQuery(queryString);

    query.setParameter("projectId", projectId);
    query.setParameter("timezoneId", timezoneId);
    query.setParameter("period", aggregationPeriod.toString());
    query.setParameter("fromDate", fromDate);
    query.setParameter("toDate", toDate);
    if (!skipEventTriggers) {
        query.setParameter("eventTriggerIds", eventTriggerIds);
    }
    if (!skipSubjects) {
        query.setParameter("subjectIds", subjectIds);
    }
    if (!skipMetrics) {
        query.setParameter("metricIds", metricIds);
    }
    if (!skipEventLevels) {
        List<Integer> eventLevelOrdinals =
            eventLevels.stream().map(Enum::ordinal).collect(Collectors.toList());
        query.setParameter("eventLevels", eventLevelOrdinals);
    }

    List<?> resultList = query.getResultList();

    Stream<AggregateQueryEntity> stream = resultList.stream().map(obj -> {
        Object[] array = (Object[]) obj;
        Timestamp timestamp = (Timestamp) array[0];
        Integer eventLevelOrdinal = (Integer) array[1];
        EventLevel eventLevel = EventLevel.values()[eventLevelOrdinal];
        BigInteger count = (BigInteger) array[2];
        return new AggregateQueryEntity(timestamp, eventLevel, count.longValue());
    });
    return transformQueryResult(stream);
}

private PeriodAggregationDTO[] transformQueryResult(Stream<AggregateQueryEntity> stream) {
    // we specifically use LinkedHashMap to maintain ordering. We also set Linkedlist explicitly
    // because there are no guarantees for this list type with toList()
    Map<Timestamp, List<AggregateQueryEntity>> aggregatesByDate = stream
        .collect(Collectors.groupingBy(AggregateQueryEntity::getTimestamp,
            LinkedHashMap::new, Collectors.toCollection(LinkedList::new)));

    return aggregatesByDate.entrySet().stream().map(entryByDate -> {
        PeriodAggregationDTO dto = new PeriodAggregationDTO();
        dto.setFromDate((Date.from(entryByDate.getKey().toInstant())));
        List<AggregateQueryEntity> value = entryByDate.getValue();
        List<EventLevelAggregationDTO> eventLevelAggregationDTOS = getAggregatesByEventLevel(value);
        dto.setEventLevels(eventLevelAggregationDTOS);
        return dto;
    }).toArray(PeriodAggregationDTO[]::new);
}

private List<EventLevelAggregationDTO> getAggregatesByEventLevel(
    List<AggregateQueryEntity> value) {
    Map<EventLevel, AggregateQueryEntity> aggregatesByEventLevel = value.stream()
        .collect(Collectors.toMap(AggregateQueryEntity::getEventLevel, Function.identity(), (u, v) -> {
            throw new InternalException(String.format("Unexpected duplicate event level %s", u));
        }, LinkedHashMap::new));
    return aggregatesByEventLevel.values().stream().map(aggregateQueryEntity -> {
        EventLevelAggregationDTO eventLevelAggregationDTO = new EventLevelAggregationDTO();
        eventLevelAggregationDTO.setEventLevel(aggregateQueryEntity.getEventLevel());
        eventLevelAggregationDTO.setCount(aggregateQueryEntity.getCount());
        return eventLevelAggregationDTO;
    }).collect(Collectors.toCollection(LinkedList::new));
}

With another data class:

@Data
class AggregateQueryEntity {

    private final Timestamp timestamp;
    private final EventLevel eventLevel;
    private final long count;
}

Simple enough solution will be patching it with java code rather than retrieving it from SQL directly - not saying it's impossible but maybe rather complicated. below is the java code that you can patch in. Just like simple query get date, time and timezone from SQL result regardless of timezone difference.

date 1: 00:00
date 2: 00:00
date 3: 00:00
(now a DST change happens)
date 3: 23:00
date 4: 23:00

for example in your case Daylight savings takes place between date 3 and date 4. Consider date 3 as oldDate and date 4 as newDate variable in below java code. Step 1 : Retrieve timezone from both the dates with newDate.getTimezoneOffset() and oldDate.getTimezoneOffset()

TimeZone timezone = TimeZone.getDefault();
{
// compare this 2 timezone to see if they are in different timezone that way you will see if Daylight saving changes took place. i.e. (GMT and BST (+1) )
// calculation will only be done if timezones are different
if(!(oldDate.getTimezoneOffset() == newDate.getTimezoneOffset()) ){
//save time to modify it later on
final long newTime = newDate.getTime(); 
//this function will check time difference caused by DST
long timediff = checkTimeZoneDiff(oldDate, newDate)

//update newDate (date 4) based on difference found.
newDate = new Date(time+timediff);
}


private long checkTimeZoneDiff(newDate,oldDate){
if(timezone.inDaylightTime(oldDate))
   // this will add +1 hour
    return timezone.getDSTSavings();
else if (timezone.inDaylightTime(newDate)){
   /* this will remove -1 hour, in your case code should go through this bit resulting in 24 hour correct day*/
    return -timezone.getDSTSavings()
else
    return 0;
}

Hope that makes sense, you will be adding timediff to newDate(date 4). And continue same process for every other. See bubble short algorithm for checking values in that sequence.

If you use timestamp with time zone , it should work just as you expect, because adding 1 day will sometimes add 23 or 25 hours:

SHOW timezone;

   TimeZone    
---------------
 Europe/Vienna
(1 row)

SELECT * from generate_series(
                 TIMESTAMP WITH TIME ZONE '2019-03-28',
                 TIMESTAMP WITH TIME ZONE '2019-04-05',
                 INTERVAL '1' DAY
              );

    generate_series     
------------------------
 2019-03-28 00:00:00+01
 2019-03-29 00:00:00+01
 2019-03-30 00:00:00+01
 2019-03-31 00:00:00+01
 2019-04-01 00:00:00+02
 2019-04-02 00:00:00+02
 2019-04-03 00:00:00+02
 2019-04-04 00:00:00+02
 2019-04-05 00:00:00+02
(9 rows)

As you can see, this hinges on the current setting of timezone , which is respected by the date arithmetic performed by generate_series .

If you want to use this, you'll have to adjust the parameter for each query. Fortunately this is not difficult:

BEGIN;  -- a transaction
SET LOCAL timezone = 'whatever';  -- for the transaction only
SELECT /* your query */;
COMMIT;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM