简体繁体 English

使用一个表的多个索引优化数据库性能

[英]Optimizing DB performance with multiple indices for one table

原文 2019-01-24 14:03:56 8 1 mysql/ sql/ database/ database-performance

I have timeseries data about a number of items that I store (in this toy example) in a simple pair of tables. 我有一些关于我存储（在这个玩具示例中）的一些项目的时间序列数据在一对简单的表格中。 For now, this is done in MySQL, but if sufficiently strong reasons exist for trying to solve my problem in a different DBMS, I'd be all ears! 目前，这是在MySQL中完成的，但如果有足够强大的理由试图在不同的DBMS中解决我的问题，我会全力以赴！

The ITEM table has a primary key and a single text-like column that can be thought of a description, let's call it descr . ITEM表有一个主键和一个类似文本的列，可以被认为是一个描述，让我们称之为descr 。 The DATAPOINT table has a primary key and 3 other columns: a foreign key into the ITEM table (call it fk_item ), a datetime i'll call timestamp and float value that we'll call value . DATAPOINT表有一个主键和3个其他列：一个进入ITEM表的外键（称之为fk_item ），一个日期时间我将调用timestamp和浮点值，我们称之为value 。 Further, there is a joint uniqueness constraint on the (fk_item, timestamp) column pair (we only want one value in the DB for a given item at a given time). 此外，在(fk_item, timestamp)列对上存在联合唯一性约束(fk_item, timestamp)在给定时间，我们只需要DB中的一个值用于给定项）。

To put real numbers on it, the DATAPOINT table has about 1bn rows, which is the result of having approximately 100k rows for each of 10k distinct items. 为了在其上放置实数， DATAPOINT表有大约10亿行，这是每10k个不同项目大约有10万行的结果。

My question is about the ability to optimize both read and write performance in this context, and the best way to enforce that uniqueness constraint. 我的问题是在这种情况下优化读写性能的能力，以及强制执行唯一性约束的最佳方法。

A typical read from this DB will involve a small number of items (half a dozen?) for which we want to get all values in a given datetime range (containing approximately 1k points per item). 从这个数据库中读取的典型数据将涉及少量项目（半打？），我们希望获得给定日期时间范围内的所有值（每个项目包含大约1k点）。 To that end, it would be very handy to have an index which is (fk_item, timestamp) and to enforce the joint uniqueness criteria on this index. 为此，拥有一个索引(fk_item, timestamp)并在该索引上强制执行联合唯一性标准将非常方便。 This motivation behind reads of this type is: "I want to make a line graph of 2 or 3 items for this 3 year range". 读取此类型背后的动机是：“我想在这3年范围内制作2或3个项目的折线图”。

However, a typical write for this database would look very different. 但是，此数据库的典型写入看起来会非常不同。 It would be an insertion of a single data point for each of several thousand items, all with the same (or a small number of) timestamps. 它将为数千个项目中的每一个插入一个数据点，所有项目都具有相同（或少量）的时间戳。 This motivation for this kind of write can be thought of intuitively as: "I want to add yesterday's datapoint for every single item". 这种写作的动机可以直观地被认为是：“我想为每一个项目添加昨天的数据点”。 So for writes of that sort, it would be more practical to have an index which is (timestamp, fk_item) , and to enforce the uniqueness restriction on that index. 因此，对于那种类型的写入，拥有一个索引(timestamp, fk_item)并对该索引强制执行唯一性限制会更实际。

Importantly, for the scale of my data and hardware, neither of these indices can be fit entirely into RAM. 重要的是，对于我的数据和硬件的规模，这些索引都不能完全适合RAM。

Typically, the vast majority of the writes happen in just a short time each day: ie at the end of each day all the data for that day gets written in a 15 minute period, and then reads occur throughout the day (but generally not during that 15 minute period). 通常，绝大多数写入每天都会在短时间内发生：即在每天结束时，当天的所有数据都会在15分钟内写入，然后在一天中进行读取（但通常不会在那15分钟的时间）。

So, from what I understand, if I build the table with the read-optimized (fk_item, timestamp) index (and put the uniqueness constraint there), then my typical reads will be nice and speedy. 因此，根据我的理解，如果我使用read-optimized (fk_item, timestamp)索引构建表（并在其中放置唯一性约束），那么我的典型读取将是美好而快速的。 But I'm concerned that my writes will be slow because we will need to update the index in a non-contiguous way. 但是我担心我的写入会很慢，因为我们需要以非连续的方式更新索引。 However, if I build the table with the write-optimized (timestamp, fk_item) index (and put the uniqueness constraint there) then my typical writes will be speedy but my typical reads will suffer. 但是，如果我使用写优化(timestamp, fk_item)索引构建表（并在其中放置唯一性约束），那么我的典型写入将(timestamp, fk_item)但我的典型读取将受到影响。

Is there any way to get the best of both worlds? 有没有办法让两全其美？ For example, if I build two indices: (fk_item, timestamp) and (timestamp, fk_item) and place the uniqueness only on the latter of the two, will that work well? 例如，如果我构建两个索引： (fk_item, timestamp) 和 (timestamp, fk_item)并将唯一性仅放在两者的后者上，那么效果会好吗？ Or will writes still proceed at the "slow" speed because even though there is a write-optimized index (to check the uniqueness constraint, for example), the read-optimized index will need to be updated on any inserts, and that update will be non-contiguous? 或者写入仍将以“慢”速度进行，因为即使存在写入优化索引（例如，检查唯一性约束），也需要在任何插入上更新读取优化索引，并且该更新将是不连续的？

Thanks in advance! 提前致谢！

1 个解决方案

Short answer: (fk_item, timestamp) only. 简短回答： (fk_item, timestamp)仅限。

Long answer: 答案很长：

As far as uniqueness goes, (fk_item, timestamp) and (timestamp, fk_item) are the same. 就唯一性而言， (fk_item, timestamp)和(timestamp, fk_item)是相同的。 While they both declare uniqueness equally well, they both suck at being unique. 虽然他们都声明唯一性同样出色，他们都吮吸是唯一的。 Someday, a particular item will show up twice in the same second. 有一天，特定项目将在同一秒内出现两次。

You did mention "yesterday". 你确实提过“昨天”。 So, if the entry is really a subtotal for the day , then (fk_item, date) is reasonable. 因此，如果条目实际上是当天的小计，那么(fk_item, date)是合理的。

When building an index, it is alway better to have the date/time item last . 在构建索引时，最好将日期/时间项最后保留。 This is so that WHERE fk_item = 123 AND date BETWEEN ... AND ... can use that index. 这样， WHERE fk_item = 123 AND date BETWEEN ... AND ...可以使用该索引。 Writes don't care (much) what order things are in. 写作不关心（多）事情的顺序。

What about the PRIMARY KEY ? PRIMARY KEY怎么样？ It is, but MySQL's definition, UNIQUE and an INDEX . 它是，但MySQL的定义， UNIQUE和INDEX 。 So, if (fk_item, date) is reasonable, make it the PK. 所以，如果(fk_item, date)合理，那就把它作为PK吧。 This will make queries that need to look at several rows for a specific item more efficient. 这将使需要查看特定项的多行的查询更有效。

"I want to make a line graph of 2 or 3 items for this 3 year range". “我希望在这3年的范围内制作2或3个项目的折线图”。 -- If that involves millions of rows, then you have designed the schema inefficiently. - 如果涉及数百万行，那么您已经无效地设计了模式。 You need to build and maintain a Summary table of, say, daily values for each item. 您需要构建和维护每个项目的每日值的摘要表。 Then it would be hundreds, not millions, of rows -- much more viable. 然后它将是数百，而不是数百万行 - 更可行。

Back to the INSERTs . 回到INSERTs 。 With 10k distinct items and PRIMARY KEY(fk_item, date) , there would be 10K spots in the table where the insert occurs. 使用10k个不同的项和PRIMARY KEY(fk_item, date) ，表中将有10K个点，其中插入发生。 This is actually OK, and will be roughly the same speed as some other ordering. 这实际上是可以的，并且与其他一些订购的速度大致相同。

The daily INSERTs are best done with either LOAD DATA INFILE or with multi-row INSERTs . 每日INSERTs最好使用LOAD DATA INFILE或多行INSERTs 。

I am speaking from a MySQL perspective. 我是从MySQL的角度讲的。 Some, though perhaps not all, of what I say applies to other products. 我说的一些（尽管可能不是全部）适用于其他产品。

PARTITIONing is a useless idea for MySQL unless you intend to purge 'old' data. 除非您打算清除“旧”数据，否则PARTITIONing对于MySQL来说是一个无用的想法。 (I can't speak for Posgres.) （我不能代表Posgres。）

If you insert rows randomly you may run into unrealistic performance problems. 如果随机插入行，可能会遇到不切实际的性能问题。 This because your real situation will be much less "random". 这是因为你的真实情况将不那么“随机”。 There will be only 10K spots where you do INSERTs today, not 1 billion. 将只有10K点，你做INSERTs今天，而不是1十亿。 And tomorrow, it will be the 'same' 10K spots. 明天，它将是“相同的”10K点。

"how a table like this should be constructed" -- Minimize datatypes (eg, don't use an 8-byte BIGINT for a yes/no flag); “如何构造这样的表” - 最小化数据类型（例如，不要使用8字节BIGINT作为是/否标志）; Provide the optimal PK (I suggested (item, day) ). 提供最佳PK（我建议(item, day) ）。 But you must have tentative SELECTs in order to settle on the secondary indexes. 但是你必须有暂定的SELECTs才能确定二级索引。 Normalize where appropriate ( item_id ), but don't over-normalize (dates). 适当时标准化（ item_id ），但不要过度标准化（日期）。