简体   繁体   English

每N秒最多选择1行

[英]Select no more than 1 row every N seconds

I have a couple of MySQL tables storing temperature data from sensors. 我有几个MySQL表,用于存储来自传感器的温度数据。 The sensors report about once every minute, and there are dozens of sensors (and growing). 传感器每分钟报告一次,并且有数十个传感器(并且还在不断增加)。 The tables have quickly grown to millions of rows, and will keep growing. 表格已迅速增长到数百万行,并将继续增长。 The two pertinent tables are data and data_temperature . 这两个相关的表是datadata_temperature

The data table is structured like this: data表的结构如下:

data_id bigint(20) unsigned NOT NULL AUTO_INCREMENT
created timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
sensor_id int(10) unsigned NOT NULL

The data_temperature table is structured like this: data_temperature表的结构如下:

temperature_id bigint(20) unsigned NOT NULL AUTO_INCREMENT
created timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
data_id bigint(20) unsigned NOT NULL
x_id varchar(32) DEFAULT NULL
x_sn varchar(16) DEFAULT NULL
x_unit char(1) DEFAULT NULL
x_value` decimal(6,2) DEFAULT NULL

Since each sensor reports about once per minute, there should be about 1440 rows per day for each sensor. 由于每个传感器每分钟报告一次,因此每个传感器每天应该大约有1440行。 But there are occasionally gaps in the data, sometimes lasting minutes, and sometimes lasting much longer. 但是,数据有时会出现间隔,有时持续几分钟,有时持续更长的时间。

I need to select a sampling of the data to display on a graph. 我需要选择数据样本以显示在图形上。 The graphs are 600 pixels wide. 图形为600像素宽。 While the time-frames of the graphs are variable (sometimes a daily graph, sometimes weekly, sometimes annually, etc), the pixel widths of the graph are fixed. 虽然图表的时间范围是可变的(有时是每日图表,有时是每周,有时是每年,等等),但图表的像素宽度是固定的。

Originally I would select a count of the rows within the timeframe, then divide that by 600 to get X , then select the rows where data_id MOD X = 0 . 最初,我将选择时间范围内的行数,然后将其除以600得到X ,然后选择data_id MOD X = 0的行。 But this doesn't work well unless only one sensor is reporting to the table. 但是,除非只有一个传感器向该表报告,否则这将无法正常工作。 With many sensors, it creates lots of gaps. 由于有许多传感器,因此会产生很多间隙。 To compensate, I'm pulling much more data than needed and overpopulating the graphs to be sure there are no holes due to this. 为了补偿,我提取了比所需更多的数据,并过度填充了图表以确保没有空洞。

The overpopulating causes slow render times in the browser. 人口过多会导致浏览器中的渲染时间变慢。 But even the SELECT COUNT() is now the major cause of the server-side slowness, which takes about 5-6 seconds to run on the data table. 但是现在,即使SELECT COUNT()也是服务器端运行缓慢的主要原因,在data表上运行大约需要5-6秒。

Ideally, what I'd like to do is to select the data from the table such that I have no more than one data point (but zero is okay, in case there is no data) in a given window. 理想情况下,我想从表中选择数据,以使给定窗口中的数据点不超过一个(如果没有数据,则为零)。 The window is the total time frame being viewed in the graph divided by the width of the graph in pixels. 窗口是在图形中查看的总时间范围除以图形的像素宽度。 So viewing a daily graph that's 600px wide would be calculated like this: 因此,查看600px宽的每日图表的计算方式如下:

86400 seconds per day / 600 pixels = 144-second window

So I would want no more than one data point every 144 seconds. 因此,我希望每144秒不超过一个数据点。 Here's the query that I've come up with so far: 到目前为止,这是我提出的查询:

SELECT data_temperature.data_id, data_temperature.created,
       ROUND( data_temperature.x_value, 1 ) AS temperature
  FROM data_temperature
         INNER JOIN data
                 ON data_temperature.data_id = data.data_id
 WHERE data.sensor_id = :sensor_id
   AND data.created BETWEEN :dt_start AND :dt_end
 GROUP BY ROUND( UNIX_TIMESTAMP( data_temperature.created ) / 144 )
 ORDER BY data.created, data.data_id

This query is an improvement both in that it returns the correct data, but also in that it runs in about 3.6 seconds. 此查询的改进之处在于它返回了正确的数据,而且运行时间约为3.6秒。 That's still much slower than what I really want. 这仍然比我真正想要的要慢得多。 So I'm wondering if there are any other thoughts on accomplishing this with a more efficient query. 因此,我想知道是否还有其他想法可以通过更有效的查询来完成此任务。

Note: Even though it doesn't look right, there's a good reason for having the data and data_temperature tables separated even though their relationship is 1-to-1. 注意:即使看起来不正确,也有充分的理由将datadata_temperature表分开,即使它们之间的关系为1比1。 When I modify my queries and structure so that everything is in a single table, it doesn't improve the query time. 当我修改查询和结构以使所有内容都在一个表中时,它不会缩短查询时间。 So I don't believe having two tables is negatively impacting performance. 因此,我认为拥有两个表不会对性能产生负面影响。

Update to clarify based on @Kevin Nelson's response 根据@Kevin Nelson的回复进行更新以澄清

It's not the GROUP BY that's slow, it's the BETWEEN in the WHERE clause that's slow. 不是GROUP BY这么慢,而是WHERE子句中的BETWEEN慢。 If I remove that, it runs much faster, but of course returns the wrong results. 如果我删除了它,它的运行速度会更快,但是当然会返回错误的结果。 If I execute a simple query like this: 如果我执行像这样的简单查询:

SELECT data.data_id, data.created
  FROM data
 WHERE data.created BETWEEN :dt_start AND :dt_end

It's also very slow. 这也很慢。 My created column is indexed, so I'm not particularly sure why. created列已建立索引,因此我不确定为什么。 I do know that the greater the range between dt_start and dt_end , the slower it takes. 我确实知道dt_startdt_end之间的范围dt_end ,所需的时间dt_end慢。 A one-day range takes about half a second. 一日范围大约需要半秒钟。 A one-week range takes about 10 seconds. 一个星期的时间大约需要10秒钟。

I apologize if I got the overall question wrong, but it sounds like you are asking how to optimize the table for the best speed when selecting the rows, because the GROUP BY you are using should be working from all I can see. 如果我对整体问题的理解是错误的,我深表歉意,但这听起来像是您在询问如何在选择行时优化表以获得最佳速度,因为从我所能看到的所有方面来看,您正在使用的GROUP BY都可以正常工作。 If your where condition is against indexed columns, then the GROUP BY shouldn't be slowing it down noticeably. 如果您的where条件针对索引列,则GROUP BY不应明显降低它的速度。

However, there are some things that you can do to potentially speed up the table queries: 但是,您可以做一些事情来潜在地加速表查询:

1) With InnoDB table, make the Primary Key a combination of the sensor_id and created PRIMARY KEY (created,sensor_id) . 1)在InnoDB表中,使主键成为sensor_id和创建的PRIMARY KEY (created,sensor_id) InnoDB uses a Clustered Index for the primary key, so it doesn't have to search the index and then go find the data. InnoDB使用聚簇索引作为主键,因此它不必搜索索引然后查找数据。 However, if possible, you want to make sure to insert the rows in order of the primary key so that it can drop it on the end. 但是,如果可能,您要确保按主键的顺序插入行,以便可以将其放在最后。

2) Use table partitioning. 2)使用表分区。 Making a partition per month or some other measure of time will create separate files that can be searched independently. 每月制作一个分区或其他一些时间度量将创建可以独立搜索的单独文件。 You just have to make sure to use the partitioning column in the WHERE clause or it will have to search every file. 您只需要确保使用WHERE子句中的分区列,否则它将必须搜索每个文件。

http://dev.mysql.com/doc/refman/5.6/en/partitioning.html http://dev.mysql.com/doc/refman/5.6/en/partitioning.html

[UPDATE BASED ON COMMENT AND Q UPDATE] [基于评论和Q更新的更新]

Trust me, I understand your model better than you think. 相信我,我比您更了解您的模型。 I'm in pretty much the same business. 我几乎从事同一行业。 For my current job, we have about 70 million records per month for our thermostats and growing rapidly. 对于我目前的工作,我们每月有大约7,000万条恒温器记录,并且该记录正在迅速增长。 We only capture every 5 minutes of data rather than every minute. 我们仅捕获每5分钟的数据,而不是每分钟。 We have over 1 billion records overall. 我们总共有超过10亿条记录。 Partitioning (either manually or with MySQL's built-in partitioning) breaks the months into their own files so that any given search only has to go through the given month's data rather than the entire DB. 分区(手动或使用MySQL的内置分区)将月份分成自己的文件,因此任何给定的搜索仅需遍历给定月份的数据,而不是整个数据库。 So, I'm not sure why you would think that partitioning isn't scalable. 因此,我不确定您为什么会认为分区不可扩展。 The whole point of partitioning is scalability. 分区的全部重点是可伸缩性。

The only other idea I've tossed around is a NoSQL file per sensor per month, which might be the ultimate in speed, but I don't know enough about NoSQL yet to know all of the ins and outs. 我唯一想到的另一个想法是每个传感器每个月都有一个NoSQL文件,这可能是最终的速度,但是我对NoSQL的了解还不够,还不了解所有的来龙去脉。

But in any case for MySQL, using the 70 mil records I mentioned on an InnoDB table with the primary key being (macAddress,timestamp)...to grab 2 days worth of entries (576 records) takes 0.140 seconds. 但是无论如何对于MySQL来说,使用我在InnoDB表上提到的7000万条记录(主键为(macAddress,timestamp))来获取价值2天的条目(576条记录)需要0.140秒。 My local machine, which is a much slower machine, only takes 0.187 seconds for the same query. 我的本地计算机(慢得多的计算机)对同一查询只花费0.187秒。 As I mentioned, because the primary key is a Clustered Index, it is WITH the data...so the data is actually ordered by mac,timestamp. 如前所述,由于主键是聚簇索引,因此它与数据一起使用...因此,数据实际上是由mac,timestamp排序的。 So, when it finds the index, it finds the data. 因此,当找到索引时,就会找到数据。 With a standard MySQL index, your code has to find the index which points it to the data, then it has to go fetch the data separately which increases the time. 使用标准的MySQL索引,您的代码必须找到将其指向数据的索引,然后必须分别获取数据,这会增加时间。

If you are using MySQL workbench, I believe this is the difference between Duration / Fetch. 如果您使用的是MySQL工作台,我相信这是Duration / Fetch之间的区别。 If you see high duration time, then it's not able to find the data. 如果您看到持续时间过长,则无法找到数据。 If you see low duration and high fetch, then (i think, but am not totally certain) it's finding the data's index quickly, but fetching it is taking time as it searches to find all those pointer locations. 如果您看到持续时间短且获取次数很高,那么(我认为,但不是完全确定)它将快速找到数据的索引,但是获取它要花费时间,因为它会搜索所有这些指针位置。 When I search on the clustered index, my fetch time is 0.031 seconds. 当我在聚集索引上搜索时,提取时间为0.031秒。

Regardless of whether you use a Clustered Index as suggested, in the end, you need to do EXPLAIN SELECT... on your query and make sure that it's actually using the indexes you expect. 不管您是否按照建议使用聚集索引,最后都需要对查询执行EXPLAIN SELECT... ,并确保它实际上在使用所需的索引。 If not, you need to find out why. 如果不是,则需要找出原因。 At the least, if you don't have it, I would create the index: 至少,如果您没有它,我将创建索引:

INDEX bySensorAndTime (sensor_id, created)

This way, MySQL only needs to use one index for your query since--I'm guessing--you would always search with both of those fields in the WHERE . 这样,MySQL只需为您的查询使用一个索引,因为-我猜-您将始终在WHERE搜索这两个字段。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM