简体   繁体   English

静态/聚合查询:完全扫描还是索引或分区?

[英]Statictis/aggregation queries: full scan or index or partitioning?

Suppose we have data like (Oracle syntax but that not significant): 假设我们有类似的数据(Oracle语法,但这并不重要):

create table EVENT (
    UUID raw(16) default sys_guid(),  -- no significant for question
    TYPE number(2,0) not null,
    DATEX date not null,
    AMOUNT number(18,2) -- we use op: SUM, COUNT, AVG, STDDEV_POP, MEDIAN, etc
);

distinct TYPE is limited to human manageable count (say 20), DATEX is for last 10 years, and AMOUNT is field for statistical analysis (like get histogram for given EVENT of AMOUNT by months in selected DATEX period). 唯一的TYPE限于人类可管理的数量(例如20), DATEX适用于最近10年, AMOUNT是统计分析领域(例如,在选定的DATEX期间, DATEX获得给定的AMOUNT EVENT的直方图)。

Number or rows is about 2e+6. 数量或行数约为2e + 6。

As all queries uses TYPE = n and DATEX between DATE 'yyyy-mm-dd' and DATE 'yyyy-mm-dd' restriction I decide make index for this field: 由于所有查询都使用TYPE = n且在DATEX between DATE 'yyyy-mm-dd' and DATE 'yyyy-mm-dd'限制DATEX between DATE 'yyyy-mm-dd' and DATE 'yyyy-mm-dd'使用了DATEX between DATE 'yyyy-mm-dd' and DATE 'yyyy-mm-dd'我决定为此字段创建索引:

create index INDEX_EVENT_MAIN on EVENT (TYPE ASC, DATEX ASC);

With full scan queries performance better than with about x2-x5 times. 全扫描查询的性能要比x2-x5倍要好。

Another strategy is split data by event TYPE across different individual tables like EVENT1, EVENT2, ... I use these tables without indexes at all. 另一种策略是按事件TYPE在不同的单个表(如EVENT1,EVENT2等)之间分割数据……我完全不用索引来使用这些表。 In this case queries performance in EVENTn table x2-x10 times better then in big EVENT table for TYPE = n (both full scan). 在这种情况下,对于TYPE = n (全扫描),在EVENTn表x2-x10中查询性能要比在大型EVENT表中查询性能好。

Also I make partitioning on EVENT table: 我也在EVENT表上进行分区:

alter table EVENT add partition event_default values (DEFAULT);

alter table DATA_XX split partition event_default values(2) into (
  partition event2,
  partition event_default);

and queries performance on EVENT = 2 become same as with separate EVENT2 table. 并且在EVENT = 2上的查询性能变得与单独的EVENT2表相同。

I am not expert in DBA and that man that makes Web 2.0 corporate sites. 我不是DBA专家,也不是创建Web 2.0企业站点的那个人。 So I can make experiments and guess but don't understand black box and can't interpret results on strong relational/algorithmic theory. 因此,我可以进行实验和猜测,但不了解黑匣子,也无法解释基于强关系/算法理论的结果。

So have related questions: 所以有相关的问题:

  • do indexes not work for statistic queries (process a wide range of rows), and full scan better? 索引对统计查询(处理大量行)不起作用,并且全扫描更好吗?
  • do indexes used only for point (non-range, get by ID) queries (non wide range)? 索引仅用于点(非范围,通过ID获取)查询(非范围)吗?
  • do table splitting or partitioning is only way to boost query performance for statistic/aggregation queries? 表拆分或分区是唯一提高统计/聚合查询性能的方法吗?

I'll answer your questions, but first some background needed to understand the answers: 我会回答您的问题,但首先需要一些背景知识来理解答案:

The time taken to perform a full scan is pretty much determined by the throughput of the hard drives on which your data resides. 执行完全扫描所需的时间在很大程度上取决于数据所驻留的硬盘驱动器的吞吐量。 If your disk can deliver 200 mb/s, it will take ~1 second to perform a full table scan of a table with 200 mb of data, regardless of the nr of rows. 如果您的磁盘可以提供200 mb / s的速度,则无论行数为nr,对具有200 mb数据的表执行全表扫描都需要大约1秒钟的时间。

Image a 200 mb table without any indexes, but where a column ID is unique within the data. 镜像一个200 mb的表,不带任何索引,但是其中列ID在数据中是唯一的。 In this case both of the below queries will take the same time, because the bulk of the time is spent waiting for hard drives to hand data to the Oracle process. 在这种情况下,以下两个查询将花费相同的时间,因为大部分时间都花在等待硬盘将数据交给Oracle进程上。

The first query will take a long time because of all of the data Oracle has to wade through in order to find the row which satisfies id = 1 . 由于要查找满足id = 1的行,Oracle必须遍历所有数据,因此第一个查询将花费很长时间。 The second query will take a long time because of all the data Oracle has to wade through in order to aggregate all values for one_column and another_column . 第二个查询将花费很长时间,因为Oracle为了汇总one_columnanother_column所有值必须one_column所有数据。

select id, one_column, another_column
  from two_hundred_mb_table
 where id = 1

select sum(one_column) / sum(another_column) 
  from two_hundred_mb_table

If you were to add an index to column ID, everything changes. 如果要向列ID添加索引,则所有内容都会更改。 The first query would now only have to visit the index for ID = 1, pick up the "rowid" which is a the physical address of the row in the data file, request the "block" on disk and then pick out the row. 现在,第一个查询只需要访问ID = 1的索引,选择“ rowid”(它是数据文件中行的物理地址),在磁盘上请求“ block”,然后选择该行。 The first query is now a lot faster because of all the data it doesn't have to wade through . 现在,由于不需要遍历所有数据,因此第一个查询要快得多。

The crucial point here is that even though you have indexed the column, you still can't pick the row directly from disk. 这里的关键点是,即使您已为该列建立索引,也仍然无法直接从磁盘中选择行。 You still have to pick up the entire block (typically ~8kb) from disk. 您仍然必须从磁盘上拾取整个块(通常为〜8kb)。 With an average row length of say 100 bytes, it means that block held 82 rows. 平均行长度为100字节,这意味着该块可容纳82行。 So you read 82 rows in order to find your one row. 因此,您读了82行才能找到自己的一行。

This is why you typically can't read a lot of rows via an index before it becomes slower than a table scan. 这就是为什么您通常无法在索引变得比表扫描慢之前通过索引读取很多行。 The reason is that you may end up re-reading the same block over and over again. 原因是您可能最终会一遍又一遍地重新读取同一块。 And of course there is a breaking point (which is different in every case) of when reading the data via full table scan becomes faster than via index. 当然,有一个转折点(在每种情况下都不同),当通过全表扫描读取数据比通过索引更快时。

Now, on to your questions: 现在,关于您的问题:

1. Do indexes not work for statistic queries (process a wide range of rows), and full scan better? 1.索引对统计查询(处理大量行)不起作用,并且全扫描更好吗? The answer to this lies in the above text. 答案就在上面的文字中。 It has nothing to do with sum/count or indexes, it has to do with the amount of data in the table, and if there is an efficient access path into the subset of interest. 它与总和/计数或索引无关,它与表中的数据量无关,如果与目标子集之间存在有效的访问路径,则与之无关。

2. Do indexes used only for point (non-range, get by ID) queries (non wide range)? 2.索引仅用于点(非范围,通过ID获取)查询(非范围)吗? Also here, the answer lies in the above text. 同样在这里,答案就在上面的文字中。 You can use range-queries on indexes, but again if the subset of interest is to large, you're better of with a full table scan. 您可以对索引使用范围查询,但是如果感兴趣的子集很大,则最好进行全表扫描。

3. Do table splitting or partitioning is only way to boost query performance for statistic/aggregation queries? 3.表拆分或分区是否仅是提高统计/聚合查询的查询性能的方法?

If the table is 2,000 mb, and your disk can return 200 mb/s, it will take you 10 seconds to perform a full table scan. 如果表为2,000 mb,并且您的磁盘可以返回200 mb / s,则将需要10秒钟来执行全表扫描。 Assuming uniform data distribution on type and you have 10 distinct values for it, you could list-partition the table by type . 假设type上的数据分布均匀,并且您有10个不同的值,则可以按type对表进行分区。 In this case, every partition would be 200 mb, and so any query on type=n would take 1 second instead of 10 seconds. 在这种情况下,每个分区将为200 mb,因此对type=n任何查询将花费1秒而不是10秒。 However all queries without type=n would still take 10 seconds. 但是,所有没有type=n查询仍将花费10秒。

You can also make a range-partition on the datex column, for example make one partition per month. 您还可以在datex列上进行范围分区,例如,每月进行一个分区。 Again assuming that the table is 2000 mb with uniform data distribution, you would end up with 1/12 of the data in each partition. 再次假设该表为2000 mb,并且数据分布均匀,那么最终每个分区中的数据将占1/12。

You can also make a combination of these, and partition by LIST(event) and RANGE(datex). 您也可以将它们组合在一起,并按LIST(event)和RANGE(datex)进行分区。

If you still cannot meet the performance requirements, you can look into creating aggregate tables (or materialized views). 如果仍然不能满足性能要求,则可以考虑创建汇总表(或实例化视图)。 For example, if you find yourself doing a lot of analysis on larger timespans, you can aggregate the data by month, and perform the higher level queries on the aggregated data. 例如,如果发现自己在较大的时间跨度上进行了大量分析,则可以按月汇总数据,然后对汇总的数据执行更高级别的查询。 Once you have found a month where you need to "drill down" into you can use the event table again with predicate that pretty much hits one partition. 一旦找到需要“深入研究”的月份,您就可以再次使用事件表,其前提是几乎命中了一个分区。

Full table scans work better for reading "large" amounts of data, indexes work better for reading "small" amounts of data. 全表扫描更适合读取“大量”数据,索引更适合读取“少量”数据。 Finding the right size depends mostly on the index clustering factor and single-block IO versus multi-block IO. 找到合适的大小主要取决于索引聚类因子以及单块IO与多块IO。

Index Clustering Factor 索引聚类因子

As Ronnis mentioned, the amount of time spent on I/O depends on the number of blocks read from disk. 正如Ronnis所提到的,花费在I / O上的时间取决于从磁盘读取的块数。 But reading multiple rows from the same block at one time is usually very cheap - once the block is in memory scanning through the rows is fast. 但是一次读取同一块中的多行通常非常便宜-一旦该块进入内存,扫描这些行就很快了。

The real issue is that depending on how the data is ordered on disk, reading a small percentage of the rows could require reading a large percentage of the data. 真正的问题是,根据磁盘上数据的排序方式,读取一小部分行可能需要读取大部分数据。 Some indexes are inefficient because of how the table data was created. 由于创建表数据的方式,有些索引效率很低。

The index clustering factor is a measure of how ordered the data is. 索引聚类因子可以衡量数据的排序。 The number is an estimate of the "number of I/Os required to read an entire table by means of an index". 该数字是“通过索引读取整个表所需的I / O数量”的估计。 That number can be found in DBA_INDEXES.CLUSTERING_FACTOR . 该号码可以在DBA_INDEXES.CLUSTERING_FACTOR找到。

You can often optimize an index by rebuilding the table with the rows sorted a specific way. 您通常可以通过用特定方式对行进行排序来重建表,从而优化索引。 But that only works for one index. 但这仅适用于一个索引。

Single-Block versus Multi-Block I/O 单块与多块I / O

Oracle can read either one block at a time or multiple blocks at a time. Oracle一次可以读取一个块,也可以一次读取多个块。

For reading a single value from a single row, obviously the fastest approach is to read as little data as possible. 为了从单个行读取单个值,显然最快的方法是读取尽可能少的数据。 Access methods like index range scans always use single-block I/O, access methods like full table scans and fast full index scans always use multi-block I/O. 诸如索引范围扫描之类的访问方法始终使用单块I / O,诸如全表扫描和快速全索引扫描之类的访问方法始终使用多块I / O。

For large amounts of data, reading data in large chunks is much more efficient than reading one-at-a-time. 对于大量数据,批量读取数据要比一次读取数据效率更高。 I can't give you a good explanation of how disk heads seek, read data from sectors, etc. The details are unimportant, there's a generic lesson here about database performance - always process in batches. 对于磁盘磁头如何查找,如何从扇区读取数据等,我无法给您一个很好的解释。细节并不重要,这里有关于数据库性能的通用课程-始终分批处理。

You may even be able to figure out the single-block vs. multi-block time on your system. 您甚至可以弄清楚系统上的单块时间与多块时间。 If you have system statistics gathered, and they are accurate, you can use these two queries to figure out the time to read a block one-at-a-time vs. in a batch: 如果您收集了系统统计信息并且它们是准确的,则可以使用这两个查询来确定一次读取一个块与一次读取一个块的时间:

--Blocks read per multi-block read, if you set the value yourself.
select value from v$parameter where name = 'db_file_multiblock_read_count';

--Time to read single and multiple blocks, in milliseconds.
--And average blocks per multi-block read.
select * from sys.aux_stats$ where pname in ('SREADTIM', 'MREADTIM', 'MBRC');

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM