简体   繁体   English

性能:使用索引和分区(PostgreSQL)

[英]Performance: Using indexing and partitioning (PostgreSQL)

I have a fairly simple database model. 我有一个相当简单的数据库模型。 My table "main" looks like this: 我的表“main”看起来像这样:

| id (PK) | device_id (int) | msg_type (int) | rawdata (text) | timestamp (date+time) |

Therefore each received message is stored within this table, including the message type, timestamp, the device which sent it and the rawdata. 因此,每个收到的消息都存储在该表中,包括消息类型,时间戳,发送它的设备和rawdata。

In addition for each possible msg_type (in total approx. 30) I have a separate table storing the parsed raw data. 另外,对于每个可能的msg_type(总共约30个),我有一个单独的表存储解析的原始数据。 Example for the table "main_type1": 表“main_type1”的示例:

| id (PK) | main_id (FK) | device_id (int) | attribute_1 | attribute_2 | attribute_n |

(Structure differs for each msg_type and the messages are not equally distributed meaning some tables are hugh some tables are small). (每个msg_type的结构不同,并且消息不是均匀分布的,这意味着某些表有些表很小)。

Please note that the device_id is always included within the rawdata, so each table has this column. 请注意,device_id始终包含在rawdata中,因此每个表都有此列。

Now to my problem: 现在我的问题:

I used to have queries such as: 我曾经有过如下问题的查询:

select attribute_1, attribute_2 from main_type1 inner join main on main_type1.main_id = main.id where timestamp > X and timestamp < Y and main.device_id = Z

At the beginning everything was sufficient and also fast. 一开始一切都足够快,也很快。 But now my database has more than 400.000.000 entries in "main". 但现在我的数据库在“main”中有超过400.000.000个条目。 Queries are taking up to 15 minutes now. 查询现在需要15分钟。

Indexing 索引

I tried to use indexing such as: 我试图使用索引,如:

CREATE INDEX device_id_index ON main (device_id);

Well, now I can retreive data much faster from the main table, but it does not help with joins. 好吧,现在我可以从主表中更快地检索数据,但它对联接没有帮助。 My biggest problem here is that I stored the timestamp information only in the main table. 我最大的问题是我只将时间戳信息存储在主表中。 So I have to join all the time... is this a general failure of my database model? 所以我必须一直加入......这是我的数据库模型的一般失败吗? I tried to avoid storing timestamps twice. 我试图避免两次存储时间戳。

Partitioning 分区

Would one solution be to create a new table with rawdata for each device_id by using partitioning? 一种解决方案是使用分区为每个device_id创建一个包含rawdata的新表吗? I would then (of course automatically) create appropriate partitions such as: 然后我(当然会自动)创建适当的分区,例如:

main_device_id_343223
main_device_id_4563
main_device_id_92338
main_device_id_4142315

Would this give me speed advantages related to the joins? 这会给我带来与连接相关的速度优势吗? What other options do I have? 我还有其他选择吗? For the sake of completeness: I am using PostgreSQL 为了完整起见:我正在使用PostgreSQL

Since your problem is the time of execution of a join , the first thing to do is try to speed up the query by creating indexes in the following way: 由于您的问题是执行join的时间,因此首先要尝试通过以下方式创建索引来加速查询:

  1. Indexes that help the join itself, in this case an index on the foreign key main.id in main_type1 (note that a foreign key declaration does not automatically create an index): 帮助连接本身的索引,在这种情况下是main.id中外键main.idmain_type1 (请注意,外键声明不会自动创建索引):

     CREATE INDEX main_type_main_id_index ON main_type1(main_id); 
  2. Indexes that help in restricting the set of data considered by the query, in this case on the timestamp attribute: 有助于限制查询所考虑的数据集的索引,在本例中为timestamp属性:

     CREATE INDEX main_timestamp_index ON main(timestamp); 

You can also consider the possibility of creating a Partial Index for the attribute timestamp, if your queries only look for specific subset of the values. 如果查询仅查找值的特定子集,您还可以考虑为属性时间戳创建部分索引的可能性。

If these indexes do not speed up the query in a significant way, then you should follow the answer of @klin . 如果这些索引不能以显着的方式加速查询,那么您应该遵循@klin的答案。

I would suggest the scenario: first, create indexes proposed by Renzo. 我建议这个场景:首先,创建Renzo提出的索引。 If that does not improve performance enough, try using partitions. 如果这不能提高性能,请尝试使用分区。

From the documentation: 从文档:

Partitioning can provide several benefits: Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. 分区可以提供多种好处:在某些情况下可以显着提高查询性能,尤其是当表的大多数访问量很大的行位于单个分区或少量分区中时。 The partitioning substitutes for leading columns of indexes, reducing index size and making it more likely that the heavily-used parts of the indexes fit in memory. 分区替代了索引的前导列,减少了索引大小,使得索引的大量使用部分更有可能适合内存。 (...) (......)

If you use partitioning all queries containing references to a specific device (such as in your question) will be much faster. 如果使用分区,则包含对特定设备的引用的所有查询(例如在您的问题中)将会快得多。 Only those queries that will apply to many device_id (eg containg aggregates) may be slower. 只有那些将应用于许多device_id的查询(例如,包含聚合)可能会更慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM