简体   繁体   English

如何最好地为没有主键的大表创建索引?

[英]How to best create an index for a large table with no primary key?

First off, I am not a database programmer. 首先,我不是数据库程序员。

I have built the following table for stock market tick data: 我为股市报价数据建立了下表:

CREATE TABLE [dbo].[Tick]
(
    [trade_date] [int] NOT NULL,
    [delimiter] [tinyint] NOT NULL,
    [time_stamp] [int] NOT NULL,
    [exchange] [tinyint] NOT NULL,
    [symbol] [varchar](10) NOT NULL,
    [price_field] [tinyint] NOT NULL,
    [price] [int] NOT NULL,
    [size_field] [tinyint] NOT NULL,
    [size] [int] NOT NULL,
    [exchange2] [tinyint] NOT NULL,
    [trade_condition] [tinyint] NOT NULL
) ON [PRIMARY]
GO

The table will store 6 years of data to begin with. 该表将开始存储6年的数据。 At an average of 300 million ticks per day that would be about 450 billion rows. 每天平均3亿滴答滴答,相当于约4500亿行。

Common query on this table is to get all the ticks for some symbol(s) over a date range: 对此表的常见查询是获取某个日期范围内某些符号的所有价格变动:

SELECT 
    trade_date, time_stamp, symbol, price, size 
WHERE 
    trade_date > 20160101 and trade_date < 20170101
    AND symbol = 'AAPL' 
    AND price_field = 0
ORDER BY 
    trade_date, time_stamp

This is my first attempt at an index: 这是我第一次尝试建立索引:

CREATE UNIQUE CLUSTERED INDEX [ClusteredIndex-20180324-183113] 
ON [dbo].[Tick]
(
    [trade_date] ASC,
    [symbol] ASC,
    [time_stamp] ASC,
    [price_field] ASC,
    [delimiter] ASC,
    [exchange] ASC,
    [price] ASC,
    [size_field] ASC,
    [size] ASC,
    [exchange2] ASC,
    [trade_condition] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

First, I put date before symbol because there's less days than symbol so the shorter path is to get to date first. 首先,我将日期放在符号之前,因为天数少于符号,因此较短的路径是首先获取日期。

I have included all the columns I would potentially need to retrieve. 我已经包含了所有可能需要检索的列。 When I tested building it for one day's worth of data the size of the index was relatively quite large, about 4gb for a 20gb table. 当我针对一天的数据量测试构建索引时,索引的大小相对较大,对于20gb的表来说大约为4gb。

Two questions: 两个问题:

  • Is my not including a primary key to save space a wise choice assuming my query requirements don't change? 假设我的查询要求没有变化,我是否不包括节省空间的主键是明智的选择?

  • Would I save space if I only include trade_date and symbol in the index? 如果我仅在索引中包含trade_date和symbol,会节省空间吗? How would that affect performance, because I've been told I need to include all the columns I need in the index otherwise retrieval would be very slow because it would have to go back to the primary key to find the values of columns not included in the index. 这将如何影响性能,因为有人告诉我我需要将所有需要的列都包含在索引中,否则检索将非常缓慢,因为它必须返回主键才能找到未包含在其中的列的值索引。 If this is true, how would that even work when my table doesn't have a primary key? 如果是这样,那么在我的表没有主键的情况下怎么办?

Your unique clustered index should contain the minimum amount of columns necessary to uniquely identify a row in your table. 您的唯一聚集索引应包含唯一标识表中一行所必需的最少列数。 If that means almost every column in your table, I would think you should add an artificial primary key. 如果这意味着表中几乎每列,我认为您应该添加一个人工主键。 Cutting an artificial primary key to save space is a poor decision IMO, only cut it if you can create a natural primary key out of your data. 削减人为的主键以节省空间是IMO的错误决定,只有在可以从数据中创建自然的主键的情况下,才进行裁切。

The clustered index is essentially where all your data is stored. 本质上,聚集索引是存储所有数据的位置。 The leaf nodes of the index contain all the data for that row, the columns that make up the index determine how to reach those leaf nodes. 索引的叶节点包含该行的所有数据,组成索引的列决定了如何到达这些叶节点。

Including extra columns in your index to speed up queries only applies to NONCLUSTERED indexes, as there the leaf node generally only contains a lookup value. 在索引中包括额外的列以加快查询速度仅适用于NONCLUSTERED索引,因为那里的叶子节点通常只包含一个查找值。 For these indexes, the way to include extra columns is to use the INCLUDE clause, not just list them all as part of the index. 对于这些索引,包括额外列的方法是使用INCLUDE子句,而不仅仅是将它们全部列为索引的一部分。 For example. 例如。

CREATE NONCLUSTERED INDEX [IX_TickSummary] ON [dbo].[Tick]
(
    [trade_date] ASC,
    [symbol] ASC
)
INCLUDE (
    [time_stamp],
    [price],
    [size],
    [price_field]
)

This is a concept known as creating a covering index, where the index itself contains all the columns needed to process your query so no additional lookup into the data table is needed. 这是称为创建覆盖索引的概念,该索引本身包含处理查询所需的所有列,因此不需要在数据表中进行其他查找。 The up side of this is increased speed. 这样做的好处是提高了速度。 The down side is that those INCLUDE'ed columns are essentially duplicated resulting in a large index and eating more space. 不利的一面是那些包含在内的列实际上是重复的,从而导致索引较大并占用更多空间。

Include columns that are used very frequently, such as those used to generate summary listings. 包括经常使用的列,例如用于生成摘要列表的列。 Columns that are queried infrequently, such as those only needed in detailed views, should be left out of the index to save space. 不经常查询的列(例如仅在详细视图中需要的列)应保留在索引之外,以节省空间。

Potentially helpful reading: Using Covering Indexes to Improve Query Performance 可能有帮助的阅读: 使用覆盖索引来提高查询性能

Looking at your most common query, you should create a composite index based first on the columns involved in the where clause: 查看最常见的查询,您应该首先基于where子句中涉及的列创建一个复合索引:

  trade_date, simbol,  price_field 

then in select 然后在选择

  time_stamp, symbol, price, size 

This way, you can use the index for where and select column retrieving avoiding access to the data table 这样,您可以将索引用于何处,并选择列检索以避免访问数据表

  trade_date, simbol, price_field, time_stamp, symbol, price, size  

In your sequence you have time_stamp before price_field .. a select column before a where column this don't let the db engine use completely the power of index 在您的序列中,在price_field之前有time_stamp ..在where列之前有一个select列,这不允许数据库引擎完全使用索引的功能

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM