简体   繁体   English

MySQL大表分片到基于唯一ID的小表

[英]MySQL Large Table Sharding to Smaller Table based on Unique ID

We have a large MySQL table (device_data) with the following columns:我们有一个包含以下列的大型 MySQL 表 (device_data):

ID (int)
dt (timestamp)
serial_number (char(20))
data1 (double)
data2 (double)
... // other columns

The table receives around 10M rows every day.该表每天接收大约 1000 万行。

We have done a sharding by separating the table based on the date of the timestamp (device_data_YYYYMMDD) .我们通过根据时间戳 (device_data_YYYYMMDD) 的日期分隔表来进行分片。 However, we feel this is not effective because most of our queries (shown below) always check on the "serial_number" and will perform across many dates.但是,我们认为这并不有效,因为我们的大多数查询(如下所示)总是检查“serial_number”并且将在多个日期执行。

SELECT * FROM device_data WHERE serial_number = 'XXX' AND dt >= '2018-01-01' AND dt <= '2018-01-07';

Therefore, we think that creating the sharding based on the serial number will be more effective.因此,我们认为根据序列号创建分片会更有效。 Basically, we will have:基本上,我们将有:

device_data_<serial_number>
device_data_0012393746
device_data_7891238456

Hence, when we want to find data for a particular device, we can easily reference as:因此,当我们想要查找特定设备的数据时,我们可以轻松地引用为:

SELECT * FROM device_data_<serial_number> WHERE dt >= '2018-01-01' AND dt <= '2018-01-07';

This approach seems to be effective because:这种方法似乎很有效,因为:

  1. The application at all time will access the data based on the device first.任何时候的应用程序都会首先访问基于设备的数据。
  2. We have checked that there is no query that access the data without specifying the device serial number first.我们已经检查过没有先指定设备序列号而访问数据的查询。
  3. The table for each device will be relatively small (9000 rows per day)每个设备的表会比较小(每天 9000 行)

A few challenges that we think we will face is:我们认为我们将面临的一些挑战是:

  1. We have alot of devices.我们有很多设备。 This means that the table device_data_ will be alot too.这意味着表 device_data_ 也会很多。 I have checked that MySQL does not provide limitation in the number of tables in the database.我已经检查过 MySQL 没有对数据库中的表数提供限制。 Will this impact on performance vs keeping them in one table?这会影响性能还是将它们放在一张桌子上?
  2. How will this impact on later on when we would like to scale MySQL (eg using master / slave, etc)?这将如何影响以后我们想要扩展 MySQL 的时间(例如使用主/从等)?
  3. Are there other alternative / solution in resolving this?是否有其他替代方案/解决方案来解决这个问题?

Update.更新。 Below is the show create table result from our existing table:下面是我们现有表的显示创建表结果:

CREATE TABLE `test_udp_new` (
 `id` int(20) unsigned NOT NULL AUTO_INCREMENT,
 `dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
 `device_sn` varchar(20) NOT NULL,
 `gps_date` datetime NOT NULL,
 `lat` decimal(10,5) DEFAULT NULL,
 `lng` decimal(10,5) DEFAULT NULL,
 PRIMARY KEY (`id`),
 KEY `device_sn_2` (`dt`,`device_sn`),
 KEY `dt` (`dt`),
 KEY `data` (`data`) USING BTREE,
 KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
 KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
) ENGINE=InnoDB AUTO_INCREMENT=44449751 DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC

The most frequent queries being run:运行最频繁的查询:

SELECT  *
    FROM  test_udp_new
    WHERE  device_sn = 'xxx'
      AND  dt >= 'xxx'
      AND  dt <= 'xxx'
    ORDER BY  dt DESC;

The optimal way to handle that query is in a non-partitioned table with处理查询的最佳方法是在非分区表中

INDEX(serial_number, dt)

Even better is to change the PRIMARY KEY .更好的是更改PRIMARY KEY Assuming you currently have id AUTO_INCREMENT because there is not a unique combination of columns suitable for being a "natural PK",假设您当前的id AUTO_INCREMENT因为没有适合成为“自然 PK”的唯一列组合,

PRIMARY KEY(serial_number, dt, id),  -- to optimize that query
INDEX(id)  -- to keep AUTO_INCREMENT happy

If there are other queries that are run often, please provide them;如果还有其他经常运行的查询,请提供; this may hurt them.这可能会伤害他们。 In large tables, it is a juggling task to find the optimal index(es).在大表中,找到最佳索引是一项杂耍任务。

Other Comments:其他的建议:

  • There are very few use cases for which partitioning actually speed up processing.很少有用例可以真正加快处理速度。
  • Making lots of 'identical' tables is a maintenance nightmare, and, again, not a performance benefit.制作大量“相同”的表是维护的噩梦,同样也不是性能优势。 There are probably a hundred Q&A on stackoverflow shouting not to do such. stackoverflow 上可能有一百个问答,大喊不要这样做。
  • By having serial_number first in the PRIMARY KEY , all queries referring to a single serial_number are likely to benefit.通过在PRIMARY KEY首先使用serial_number ,所有引用单个 serial_number 的查询都可能受益。
  • A million serial_numbers ?一百万个serial_numbers No problem.没问题。
  • One common use case for partitioning involves purging "old" data.分区的一个常见用例涉及清除“旧”数据。 This is because big DELETEs are much more costly than DROP PARTITION .这是因为大DELETEsDROP PARTITION成本更高。 That involves PARTITION BY RANGE(TO_DAYS(dt)) .这涉及PARTITION BY RANGE(TO_DAYS(dt)) If you are interested in that, my PK suggestion still stands.如果你对此感兴趣,我的 PK 建议仍然有效。 (And the query in question will run about the same speed with or without this partitioning.) (无论有没有这种分区,有问题的查询都会以大致相同的速度运行。)
  • How many months before the table outgrows your disk?在表超出您的磁盘前几个月? (If this will be an issue, let's discuss it.) (如果这是一个问题,让我们讨论一下。)
  • Do you need 8-byte DOUBLE ?你需要 8 字节的DOUBLE吗? FLOAT has about 7 significant digits of precision and takes only 4 bytes. FLOAT有大约 7 个有效数字的精度并且只占用 4 个字节。
  • You are using InnoDB?正在使用 InnoDB?
  • Is serial_number fixed at 20 characters? serial_number固定为 20 个字符吗? If not, use VARCHAR .如果没有,请使用VARCHAR Also, CHARACTER SET ascii may be better than the default of utf8?另外, CHARACTER SET ascii可能比默认的utf8 更好?
  • Each table (or each partition of a table) involves at least one file that the OS must deal with.每个表(或表的每个分区)至少涉及操作系统必须处理的一个文件。 When you have "too many", the OS groans, often before MySQL groans.当您有“太多”时,操作系统会发出呻吟,通常在 MySQL 发出呻吟之前。 (It is hard to make either "die" of overdose.) (很难让任何一个“死于”过量。)

Addressing the query解决查询

 PRIMARY KEY (`id`),
 KEY `device_sn_2` (`dt`,`device_sn`),
 KEY `dt` (`dt`),
 KEY `data` (`data`) USING BTREE,
 KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
 KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)

--> -->

 PRIMARY KEY(`device_sn`,`dt`, id),
 INDEX(id)
 KEY `dt_sn` (`dt`,`device_sn`),
 KEY `data` (`data`) USING BTREE,

Notes:笔记:

  • By starting the PK with device_sn, dt , you get the clustering benefits to make the query with WHERE device_sn = .. AND dt BETWEEN ...通过使用device_sn, dt启动 PK,您可以获得使用WHERE device_sn = .. AND dt BETWEEN ...进行查询的集群优势WHERE device_sn = .. AND dt BETWEEN ...
  • INDEX(id) is to keep AUTO_INCREMENT happy. INDEX(id)是为了让AUTO_INCREMENT开心。
  • When you have INDEX(a,b) , INDEX(a) is redundant.当您有INDEX(a,b)INDEX(a)是多余的。
  • The (20) is meaningless; (20)无意义; id will max out at about 4 billion. id最大值约为 40 亿。
  • I tossed the last index because it is probably helped enough by the new PK.我扔掉了最后一个索引,因为它可能对新的 PK 有足够的帮助。
  • lng decimal(10,5) -- Don't need 5 decimal places to left of point; lng decimal(10,5) -- 点左边不需要5个小数位; only need 3 or 2. So: lat decimal(7,5), lng decimal(8,5)`.只需要 3 或 2。所以: lat decimal(7,5), lng decimal(8,5)`。 This will save a total of 3 bytes per row.这将每行总共节省 3 个字节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM