简体   繁体   中英

Improve performance in a big MySQL table

I'd like to ask a question about how to improve performance in a big MySQL table using innodb engine:

There's currently a table in my database with around 200 million rows. This table periodically stores the data collected by different sensors. The structure of the table is as follows:

CREATE TABLE sns_value (
    value_id int(11) NOT NULL AUTO_INCREMENT,
    sensor_id int(11) NOT NULL,
    type_id int(11) NOT NULL,
    date timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
    value int(11) NOT NULL,
    PRIMARY KEY (value_id),
    KEY idx_sensor id (sensor_id),
    KEY idx_date (date),
    KEY idx_type_id (type_id) );

At first, I thought of partitioning the table in months, but due to the steady addition of new sensors it would reach the current size in about a month.

Another solution that I came up with was partitioning the table by sensors. However, due to the limit of 1024 partitions of MySQL that wasn't an option.

I believe that the right solution would be using a table with the same structure for each of the sensors:

sns_value_XXXXX

This way there would be more than 1.000 tables with an estimated size of 30 million rows per year. These tables could, at the same time, be partitioned in months for fastest access to data.

What problems would result from this solution? Is there a more normalized solution?

Editing with additional information

I consider the table to be big in relation to my server:

  • Cloud 2xCPU and 8GB Memory
  • LAMP (CentOS 6.5 and MySQL 5.1.73)

Each sensor may have more than one variable types (CO, CO2, etc.).

I mainly have two slow queries:

1) Daily summary for each sensor and type (avg, max, min):

SELECT round(avg(value)) as mean, min(value) as min, max(value) as max, type_id
FROM sns_value
WHERE sensor_id=1 AND date BETWEEN '2014-10-29 00:00:00' AND '2014-10-29 12:00:00'
GROUP BY type_id limit 2000;

This takes more than 5 min.

2) Vertical to Horizontal view and export:

SELECT sns_value.date AS date,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 101)))))) AS one,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 141)))))) AS two,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 151)))))) AS three
FROM sns_value
WHERE sns_value.sensor_id=1 AND sns_value.date BETWEEN '2014-10-28 12:28:29' AND '2014-10-29     12:28:29'
GROUP BY sns_value.sensor_id,sns_value.date LIMIT 4500;

This also takes more than 5 min.

Other considerations

  1. Timestamps may be repeated due to inserts characteristics.
  2. Periodic inserts must coexist with selects.
  3. No updates nor deletes are performed on the table.

Suppositions made to the "one table for each sensor" approach

  1. Tables for each sensor would be much smaller so access would be faster.
  2. Selects will be performed only on one table for each sensor.
  3. Selects mixing data from different sensors are not time-critical.

Update 02/02/2015

We have created a new table for each year of data, which we have also partitioned in a daily basis. Each table has around 250 million rows with 365 partitions. The new index used is as Ollie suggested (sensor_id, date, type_id, value) but the query still takes between 30 seconds and 2 minutes. We do not use the first query (daily summary), just the second (vertical to horizontal view).

In order to be able to partition the table, the primary index had to be removed.

Are we missing something? Is there a way to improve the performance?

Many thanks!

Edited based on changes to the question

One table per sensor is, with respect, a very bad idea indeed. There are several reasons for that:

  1. MySQL servers on ordinary operating systems have a hard time with thousands of tables. Most OSs can't handle that many simultaneous file accesses at once.
  2. You'll have to create tables each time you add (or delete) sensors.
  3. Queries that involve data from multiple sensors will be slow and convoluted.

My previous version of this answer suggested range partitioning by timestamp. But that won't work with your value_id primary key. However, with the queries you've shown and proper indexing of your table, partitioning probably won't be necessary.

(Avoid the column name date if you can: it's a reserved word and you'll have lots of trouble writing queries. Instead I suggest you use ts , meaning timestamp.)

Beware : int(11) values aren't aren't big enough for your value_id column. You're going to run out of ids. Use bigint(20) for that column.

You've mentioned two queries. Both these queries can be made quite efficient with appropriate compound indexes, even if you keep all your values in a single table. Here's the first one.

SELECT round(avg(value)) as mean, min(value) as min, max(value) as max,
       type_id
  FROM sns_value
 WHERE sensor_id=1
  AND date BETWEEN '2014-10-29 00:00:00' AND '2014-10-29 12:00:00'
GROUP BY type_id limit 2000;

For this query, you're first looking up sensor_id using a constant, then you're looking up a range of date values, then you're aggregating by type_id . Finally you're extracting the value column. Therefore, a so-called compound covering index on (sensor_id, date, type_id, value) will be able to satisfy your query directly with an index scan. This should be very fast for you--certainly faster than 5 minutes even with a large table.

In your second query, a similar indexing strategy will work.

SELECT sns_value.date AS date,
       sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 101)))))) AS one,
       sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 141)))))) AS two,
       sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 151)))))) AS three
  FROM sns_value
 WHERE sns_value.sensor_id=1
   AND sns_value.date BETWEEN '2014-10-28 12:28:29' AND '2014-10-29 12:28:29'
 GROUP BY sns_value.sensor_id,sns_value.date
 LIMIT 4500;

Again, you start with a constant value of sensor_id and then use a date range. You then extract both type_id and value . That means the same four column index I mentioned should work for you.

CREATE TABLE sns_value (
    value_id  bigint(20) NOT NULL AUTO_INCREMENT,
    sensor_id int(11) NOT NULL,
    type_id   int(11) NOT NULL,
    ts        timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
    value int(11) NOT NULL,
  PRIMARY KEY        (value_id),
  INDEX    query_opt (sensor_id, ts, type_id, value)
);

Creating separate table for a range of sensors would be an idea.

Do not use the auto_increment for a primary key, if you dont have to. Usually DB engine is clustering the data by its primary key.

Use composite key instead, depends from your usecase, the sequence of columns may be different.

EDIT: Also added the type into the PK. Considering the queries, i would do it like this. Choosing the field names is intentional, they should be descriptive and always consider the reserverd words.

CREATE TABLE snsXX_readings (
    sensor_id int(11) NOT NULL,
    reading int(11) NOT NULL,
    reading_time timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
    type_id int(11) NOT NULL,

    PRIMARY KEY (reading_time, sensor_id, type_id),
    KEY idx date_idx (date),
    KEY idx type_id (type_id) 
);

Also, consider summarizing the readings or grouping them into a single field.

You can try get randomize summary data

I have similar table. table engine myisam(smallest table size), 10m record, no index on my table because useless(tested). Get all range for the all data. result:10sn this query.

SELECT * FROM (
        SELECT sensor_id, value, date 
        FROM sns_value l 
        WHERE l.sensor_id= 123 AND 
        (l.date BETWEEN '2013-10-29 12:28:29' AND '2015-10-29 12:28:29') 
        ORDER BY RAND() LIMIT 2000 
    ) as tmp
    ORDER BY tmp.date;

This query on first step get between dates and sorting randomize first 2k data, on the second step sort data. the query every time get 2k result for different data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM