简体   繁体   中英

Select no more than 1 row every N seconds

I have a couple of MySQL tables storing temperature data from sensors. The sensors report about once every minute, and there are dozens of sensors (and growing). The tables have quickly grown to millions of rows, and will keep growing. The two pertinent tables are data and data_temperature .

The data table is structured like this:

data_id bigint(20) unsigned NOT NULL AUTO_INCREMENT
created timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
sensor_id int(10) unsigned NOT NULL

The data_temperature table is structured like this:

temperature_id bigint(20) unsigned NOT NULL AUTO_INCREMENT
created timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
data_id bigint(20) unsigned NOT NULL
x_id varchar(32) DEFAULT NULL
x_sn varchar(16) DEFAULT NULL
x_unit char(1) DEFAULT NULL
x_value` decimal(6,2) DEFAULT NULL

Since each sensor reports about once per minute, there should be about 1440 rows per day for each sensor. But there are occasionally gaps in the data, sometimes lasting minutes, and sometimes lasting much longer.

I need to select a sampling of the data to display on a graph. The graphs are 600 pixels wide. While the time-frames of the graphs are variable (sometimes a daily graph, sometimes weekly, sometimes annually, etc), the pixel widths of the graph are fixed.

Originally I would select a count of the rows within the timeframe, then divide that by 600 to get X , then select the rows where data_id MOD X = 0 . But this doesn't work well unless only one sensor is reporting to the table. With many sensors, it creates lots of gaps. To compensate, I'm pulling much more data than needed and overpopulating the graphs to be sure there are no holes due to this.

The overpopulating causes slow render times in the browser. But even the SELECT COUNT() is now the major cause of the server-side slowness, which takes about 5-6 seconds to run on the data table.

Ideally, what I'd like to do is to select the data from the table such that I have no more than one data point (but zero is okay, in case there is no data) in a given window. The window is the total time frame being viewed in the graph divided by the width of the graph in pixels. So viewing a daily graph that's 600px wide would be calculated like this:

86400 seconds per day / 600 pixels = 144-second window

So I would want no more than one data point every 144 seconds. Here's the query that I've come up with so far:

SELECT data_temperature.data_id, data_temperature.created,
       ROUND( data_temperature.x_value, 1 ) AS temperature
  FROM data_temperature
         INNER JOIN data
                 ON data_temperature.data_id = data.data_id
 WHERE data.sensor_id = :sensor_id
   AND data.created BETWEEN :dt_start AND :dt_end
 GROUP BY ROUND( UNIX_TIMESTAMP( data_temperature.created ) / 144 )
 ORDER BY data.created, data.data_id

This query is an improvement both in that it returns the correct data, but also in that it runs in about 3.6 seconds. That's still much slower than what I really want. So I'm wondering if there are any other thoughts on accomplishing this with a more efficient query.

Note: Even though it doesn't look right, there's a good reason for having the data and data_temperature tables separated even though their relationship is 1-to-1. When I modify my queries and structure so that everything is in a single table, it doesn't improve the query time. So I don't believe having two tables is negatively impacting performance.

Update to clarify based on @Kevin Nelson's response

It's not the GROUP BY that's slow, it's the BETWEEN in the WHERE clause that's slow. If I remove that, it runs much faster, but of course returns the wrong results. If I execute a simple query like this:

SELECT data.data_id, data.created
  FROM data
 WHERE data.created BETWEEN :dt_start AND :dt_end

It's also very slow. My created column is indexed, so I'm not particularly sure why. I do know that the greater the range between dt_start and dt_end , the slower it takes. A one-day range takes about half a second. A one-week range takes about 10 seconds.

I apologize if I got the overall question wrong, but it sounds like you are asking how to optimize the table for the best speed when selecting the rows, because the GROUP BY you are using should be working from all I can see. If your where condition is against indexed columns, then the GROUP BY shouldn't be slowing it down noticeably.

However, there are some things that you can do to potentially speed up the table queries:

1) With InnoDB table, make the Primary Key a combination of the sensor_id and created PRIMARY KEY (created,sensor_id) . InnoDB uses a Clustered Index for the primary key, so it doesn't have to search the index and then go find the data. However, if possible, you want to make sure to insert the rows in order of the primary key so that it can drop it on the end.

2) Use table partitioning. Making a partition per month or some other measure of time will create separate files that can be searched independently. You just have to make sure to use the partitioning column in the WHERE clause or it will have to search every file.

http://dev.mysql.com/doc/refman/5.6/en/partitioning.html

[UPDATE BASED ON COMMENT AND Q UPDATE]

Trust me, I understand your model better than you think. I'm in pretty much the same business. For my current job, we have about 70 million records per month for our thermostats and growing rapidly. We only capture every 5 minutes of data rather than every minute. We have over 1 billion records overall. Partitioning (either manually or with MySQL's built-in partitioning) breaks the months into their own files so that any given search only has to go through the given month's data rather than the entire DB. So, I'm not sure why you would think that partitioning isn't scalable. The whole point of partitioning is scalability.

The only other idea I've tossed around is a NoSQL file per sensor per month, which might be the ultimate in speed, but I don't know enough about NoSQL yet to know all of the ins and outs.

But in any case for MySQL, using the 70 mil records I mentioned on an InnoDB table with the primary key being (macAddress,timestamp)...to grab 2 days worth of entries (576 records) takes 0.140 seconds. My local machine, which is a much slower machine, only takes 0.187 seconds for the same query. As I mentioned, because the primary key is a Clustered Index, it is WITH the data...so the data is actually ordered by mac,timestamp. So, when it finds the index, it finds the data. With a standard MySQL index, your code has to find the index which points it to the data, then it has to go fetch the data separately which increases the time.

If you are using MySQL workbench, I believe this is the difference between Duration / Fetch. If you see high duration time, then it's not able to find the data. If you see low duration and high fetch, then (i think, but am not totally certain) it's finding the data's index quickly, but fetching it is taking time as it searches to find all those pointer locations. When I search on the clustered index, my fetch time is 0.031 seconds.

Regardless of whether you use a Clustered Index as suggested, in the end, you need to do EXPLAIN SELECT... on your query and make sure that it's actually using the indexes you expect. If not, you need to find out why. At the least, if you don't have it, I would create the index:

INDEX bySensorAndTime (sensor_id, created)

This way, MySQL only needs to use one index for your query since--I'm guessing--you would always search with both of those fields in the WHERE .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM