简体   繁体   English

在 MySQL 中查找时间序列数据差距的方法?

[英]Method of finding gaps in time series data in MySQL?

Lets say we have a database table with two columns, entry_time and value.假设我们有一个包含两列的数据库表,entry_time 和 value。 entry_time is timestamp while value can be any other datatype. entry_time 是时间戳,而 value 可以是任何其他数据类型。 The records are relatively consistent, entered in roughly x minute intervals.记录相对一致,以大约 x 分钟的间隔输入。 For many x's of time, however, an entry may not be made, thus producing a 'gap' in the data.然而,在许多 x 时间内,可能无法进行输入,从而在数据中产生“间隙”。

In terms of efficiency, what is the best way to go about finding these gaps of at least time Y (both new and old) with a query?在效率方面,通过查询找到至少时间 Y(新旧)的这些差距的最佳方法是什么?

To start with, let us summarize the number of entries by hour in your table.首先,让我们按小时汇总表中的条目数。

SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour,
       COUNT(*) samplecount
  FROM table
 GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)

Now, if you log something every six minutes (ten times an hour) all your samplecount values should be ten.现在,如果您每 6 分钟(每小时 10 次)记录一些内容,您的所有 samplecount 值都应该是 10。 This expression: CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) looks hairy but it simply truncates your timestamps to the hour in which they occur by zeroing out the minute and second.这个表达式: CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)看起来很CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)但它只是通过将分钟和第二。

This is reasonably efficient, and will get you started.这是相当有效的,并且会让你开始。 It's very efficient if you can put an index on your entry_time column and restrict your query to, let's say, yesterday's samples as shown here.如果您可以在 entry_time 列上放置一个索引并将您的查询限制为昨天的示例,那么这是非常有效的,如下所示。

SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour,
       COUNT(*) samplecount
  FROM table
 WHERE entry_time >= CURRENT_DATE - INTERVAL 1 DAY
   AND entry_time < CURRENT_DATE
 GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)

But it isn't much good at detecting whole hours that go by with missing samples.但它在检测丢失样本的整个小时内并不是很好。 It's also a little sensitive to jitter in your sampling.它对采样中的抖动也有点敏感。 That is, if your top-of-the-hour sample is sometimes a half-second early (10:59:30) and sometimes a half-second late (11:00:30) your hourly summary counts will be off.也就是说,如果您的最高小时样本有时提前半秒 (10:59:30),有时延迟半秒 (11:00:30),您的每小时摘要计数将被关闭。 So, this hour summary thing (or day summary, or minute summary, etc) is not bulletproof.所以,这个小时摘要(或日摘要,或分钟摘要等)并不是万无一失的。

You need a self-join query to get stuff perfectly right;您需要一个自联接查询才能使内容完全正确; it's a bit more of a hairball and not nearly as efficient.它有点像毛球,效率不高。

Let's start by creating ourselves a virtual table (subquery) like this with numbered samples.让我们首先创建一个像这样带有编号样本的虚拟表(子查询)。 (This is a pain in MySQL; some other expensive DBMSs make it easier. No matter.) (这在 MySQL 中是一个痛苦;一些其他昂贵的 DBMS 使它更容易。不管。)

  SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value
    FROM (
        SELECT entry_time, value
      FROM table
         ORDER BY entry_time
    ) C,
    (SELECT @sample:=0) s

This little virtual table gives entry_num, entry_time, value.这个小虚拟表给出了 entry_num、entry_time、值。

Next step, we join it to itself.下一步,我们将其连接到自身。

SELECT one.entry_num, one.entry_time, one.value, 
       TIMEDIFF(two.value, one.value) interval
  FROM (
     /* virtual table */
  ) ONE
  JOIN (
     /* same virtual table */
  ) TWO ON (TWO.entry_num - 1 = ONE.entry_num)

This lines up the tables next two each other offset by a single entry, governed by the ON clause of the JOIN.这将相邻的两个表排列在一起,彼此偏移一个条目,由 JOIN 的 ON 子句控制。

Finally we choose the values from this table with an interval larger than your threshold, and there are the times of the samples right before the missing ones.最后,我们从这个表中选择一个interval大于你的阈值的值,并且在缺失的样本之前有样本的次数。

The over all self join query is this.整个自连接查询是这样的。 I told you it was a hairball.我告诉过你这是一个毛球。

SELECT one.entry_num, one.entry_time, one.value, 
       TIMEDIFF(two.value, one.value) interval
  FROM (
    SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value
      FROM (
          SELECT entry_time, value
            FROM table
           ORDER BY entry_time
      ) C,
      (SELECT @sample:=0) s
  ) ONE
  JOIN (
    SELECT @sample2:=@sample2+1 AS entry_num, c.entry_time, c.value
      FROM (
          SELECT entry_time, value
            FROM table
           ORDER BY entry_time
      ) C,
      (SELECT @sample2:=0) s
  ) TWO ON (TWO.entry_num - 1 = ONE.entry_num)

If you have to do this in production on a large table you may want to do it for a subset of your data.如果您必须在大表的生产中执行此操作,您可能希望对您的数据子集执行此操作。 For example, you could do it each day for the previous two days' samples.例如,您可以每天为前两天的样本执行此操作。 This would be decently efficient, and would also make sure you didn't overlook any missing samples right at midnight.这将非常有效,并且还可以确保您不会在午夜时忽略任何丢失的样本。 To do this your little rownumbered virtual tables would look like this.为此,您的小行编号虚拟表将如下所示。

  SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value
    FROM (
        SELECT entry_time, value
      FROM table
         ORDER BY entry_time
         WHERE entry_time >= CURRENT_DATE - INTERVAL 2 DAY
           AND entry_time < CURRENT_DATE /*yesterday but not today*/
    ) C,
    (SELECT @sample:=0) s

A very efficient way to do this is with a stored procedure using cursors.一种非常有效的方法是使用使用游标的存储过程。 I think this is simpler and more efficient than the other answers.我认为这比其他答案更简单、更有效。

This procedure creates a cursor and iterates it through the datetime records that you are checking.此过程创建一个游标并遍历您正在检查的日期时间记录。 If there is ever a gap of more than what you specify, it will write the gap's begin and end to a table.如果有超过您指定的差距,它会将差距的开始和结束写入表格。

    CREATE PROCEDURE findgaps()
    BEGIN    
    DECLARE done INT DEFAULT FALSE;
    DECLARE a,b DATETIME;
    DECLARE cur CURSOR FOR SELECT dateTimeCol FROM targetTable
                           ORDER BY dateTimeCol ASC;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;     
    OPEN cur;       
    FETCH cur INTO a;       
    read_loop: LOOP
        SET b = a;
        FETCH cur INTO a;   
        IF done THEN
            LEAVE read_loop;
        END IF;     
        IF DATEDIFF(a,b) > [range you specify] THEN
            INSERT INTO tmp_table (gap_begin, gap_end)
            VALUES (a,b);
        END IF;
    END LOOP;           
    CLOSE cur;      
    END;

In this case it is assumed that 'tmp_table' exists.在这种情况下,假定存在“tmp_table”。 You could easily define this as a TEMPORARY table in the procedure, but I left it out of this example.您可以轻松地将其定义为过程中的 TEMPORARY 表,但我将其排除在本示例之外。

I'm trying this on MariaDB 10.3.27 so this procedure may not work, but I'm getting an error creating the procedure and I can't figure out why!我正在 MariaDB 10.3.27 上尝试这个,所以这个过程可能不起作用,但我在创建过程时遇到错误,我不知道为什么! I have a table called electric_use with a field Intervaldatetime DATETIME that I want to find gaps in. I created a target table electric_use_gaps with fields of gap_begin datetime and gap_end datetime我有一个名为electric_use的表,其中包含一个字段Intervaldatetime DATETIME ,我想在其中找到间隙。我创建了一个目标表electric_use_gaps其中包含gap_begin datetimegap_end datetime字段

The data are taken every hour and I want to know if I'm missing even an hour's worth of data across 5 years.数据每小时采集一次,我想知道我是否在 5 年内丢失了一个小时的数据。

 DELIMITER $$  
  CREATE PROCEDURE findgaps()
    BEGIN    
    DECLARE done INT DEFAULT FALSE;
    DECLARE a,b DATETIME;
    DECLARE cur CURSOR FOR SELECT Intervaldatetime FROM electric_use
                           ORDER BY Intervaldatetime ASC;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;     
    OPEN cur;       
    FETCH cur INTO a;       
    read_loop: LOOP
        SET b = a;
        FETCH cur INTO a;   
        IF done THEN
            LEAVE read_loop;
        END IF;     
        IF TIMESTAMPDIFF(MINUTE,a,b) > [60] THEN
            INSERT INTO electric_use_gaps(gap_begin, gap_end)
            VALUES (a,b);
        END IF;
    END LOOP;           
    CLOSE cur;      
    END&&
    
    DELIMITER ;

This is the error:这是错误:

Query: CREATE PROCEDURE findgaps() BEGIN DECLARE done INT DEFAULT FALSE; DECLARE a,b DATETIME; DECLARE cur CURSOR FOR SELECT Intervalda...

Error Code: 1064
You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '[60] THEN
            INSERT INTO electric_use_gaps(gap_begin, gap_end)
   ...' at line 16

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM