[英]MySQL Large Table Sharding to Smaller Table based on Unique ID
We have a large MySQL table (device_data) with the following columns:我们有一个包含以下列的大型 MySQL 表 (device_data):
ID (int)
dt (timestamp)
serial_number (char(20))
data1 (double)
data2 (double)
... // other columns
The table receives around 10M rows every day.该表每天接收大约 1000 万行。
We have done a sharding by separating the table based on the date of the timestamp (device_data_YYYYMMDD) .我们通过根据时间戳 (device_data_YYYYMMDD) 的日期分隔表来进行分片。 However, we feel this is not effective because most of our queries (shown below) always check on the "serial_number" and will perform across many dates.
但是,我们认为这并不有效,因为我们的大多数查询(如下所示)总是检查“serial_number”并且将在多个日期执行。
SELECT * FROM device_data WHERE serial_number = 'XXX' AND dt >= '2018-01-01' AND dt <= '2018-01-07';
Therefore, we think that creating the sharding based on the serial number will be more effective.因此,我们认为根据序列号创建分片会更有效。 Basically, we will have:
基本上,我们将有:
device_data_<serial_number>
device_data_0012393746
device_data_7891238456
Hence, when we want to find data for a particular device, we can easily reference as:因此,当我们想要查找特定设备的数据时,我们可以轻松地引用为:
SELECT * FROM device_data_<serial_number> WHERE dt >= '2018-01-01' AND dt <= '2018-01-07';
This approach seems to be effective because:这种方法似乎很有效,因为:
A few challenges that we think we will face is:我们认为我们将面临的一些挑战是:
Update.更新。 Below is the show create table result from our existing table:
下面是我们现有表的显示创建表结果:
CREATE TABLE `test_udp_new` (
`id` int(20) unsigned NOT NULL AUTO_INCREMENT,
`dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`device_sn` varchar(20) NOT NULL,
`gps_date` datetime NOT NULL,
`lat` decimal(10,5) DEFAULT NULL,
`lng` decimal(10,5) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `device_sn_2` (`dt`,`device_sn`),
KEY `dt` (`dt`),
KEY `data` (`data`) USING BTREE,
KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
) ENGINE=InnoDB AUTO_INCREMENT=44449751 DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC
The most frequent queries being run:运行最频繁的查询:
SELECT *
FROM test_udp_new
WHERE device_sn = 'xxx'
AND dt >= 'xxx'
AND dt <= 'xxx'
ORDER BY dt DESC;
The optimal way to handle that query is in a non-partitioned table with处理该查询的最佳方法是在非分区表中
INDEX(serial_number, dt)
Even better is to change the PRIMARY KEY
.更好的是更改
PRIMARY KEY
。 Assuming you currently have id AUTO_INCREMENT
because there is not a unique combination of columns suitable for being a "natural PK",假设您当前的
id AUTO_INCREMENT
因为没有适合成为“自然 PK”的唯一列组合,
PRIMARY KEY(serial_number, dt, id), -- to optimize that query
INDEX(id) -- to keep AUTO_INCREMENT happy
If there are other queries that are run often, please provide them;如果还有其他经常运行的查询,请提供; this may hurt them.
这可能会伤害他们。 In large tables, it is a juggling task to find the optimal index(es).
在大表中,找到最佳索引是一项杂耍任务。
Other Comments:其他的建议:
serial_number
first in the PRIMARY KEY
, all queries referring to a single serial_number are likely to benefit.PRIMARY KEY
首先使用serial_number
,所有引用单个 serial_number 的查询都可能受益。serial_numbers
?serial_numbers
? No problem.DELETEs
are much more costly than DROP PARTITION
.DELETEs
比DROP PARTITION
成本更高。 That involves PARTITION BY RANGE(TO_DAYS(dt))
.PARTITION BY RANGE(TO_DAYS(dt))
。 If you are interested in that, my PK suggestion still stands.DOUBLE
?DOUBLE
吗? FLOAT
has about 7 significant digits of precision and takes only 4 bytes. FLOAT
有大约 7 个有效数字的精度并且只占用 4 个字节。serial_number
fixed at 20 characters? serial_number
固定为 20 个字符吗? If not, use VARCHAR
.VARCHAR
。 Also, CHARACTER SET ascii
may be better than the default of utf8?CHARACTER SET ascii
可能比默认的utf8 更好?Addressing the query解决查询
PRIMARY KEY (`id`),
KEY `device_sn_2` (`dt`,`device_sn`),
KEY `dt` (`dt`),
KEY `data` (`data`) USING BTREE,
KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
--> -->
PRIMARY KEY(`device_sn`,`dt`, id),
INDEX(id)
KEY `dt_sn` (`dt`,`device_sn`),
KEY `data` (`data`) USING BTREE,
Notes:笔记:
device_sn, dt
, you get the clustering benefits to make the query with WHERE device_sn = .. AND dt BETWEEN ...
device_sn, dt
启动 PK,您可以获得使用WHERE device_sn = .. AND dt BETWEEN ...
进行查询的集群优势WHERE device_sn = .. AND dt BETWEEN ...
INDEX(id)
is to keep AUTO_INCREMENT
happy. INDEX(id)
是为了让AUTO_INCREMENT
开心。INDEX(a,b)
, INDEX(a)
is redundant.INDEX(a,b)
, INDEX(a)
是多余的。(20)
is meaningless; (20)
无意义; id
will max out at about 4 billion. id
最大值约为 40 亿。lng decimal(10,5)
-- Don't need 5 decimal places to left of point; lng decimal(10,5)
-- 点左边不需要5个小数位; only need 3 or 2. So: lat decimal(7,5),
lng decimal(8,5)`.lat decimal(7,5),
lng decimal(8,5)`。 This will save a total of 3 bytes per row.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.