简体   繁体   English

如何提高Oracle SQL中两个SCD2表的JOIN性能

[英]How to improve performance of a JOIN of two SCD2 tables in Oracle SQL

I have two tables, both using valid to and valid from logic. 我有两个表,都使用有效和逻辑有效。 Table 1 looks like this: 表1看起来像这样:

ID | VALID_FROM | VALID_TO 
1  | 01.01.2000 | 04.01.2000
1  | 04.01.2000 | 16.01.2000
1  | 16.01.2000 | 17.01.2000
1  | 17.01.2000 | 19.01.2000
2  | 03.02.2001 | 04.04.2001
2  | 04.04.2001 | 14.03.2001
2  | 14.04.2001 | 18.03.2001

while table 2 looks like this: 而表2看起来像这样:

ID | VAR | VALID_FROM | VALID_TO 
1  |  3  | 01.01.2000 | 17.01.2000
1  |  2  | 17.01.2000 | 19.01.2000
2  |  4  | 03.02.2001 | 14.03.2001
  • Table 1 has 132,195,791 rows and table 2 has 16,964,846. 表1有132,195,791行,​​表2有16,964,846。
  • The valid from and valid to date of any observation in table 1 is within one or more valid from to valid to windows shown in table 2. 表1中任何观​​察的有效期和有效期在表2中所示的窗口中的一个或多个有效期内。
  • I created primary keys for both of them over ID and VALID_FROM 我通过ID和VALID_FROM为它们创建了主键
  • I want to do an inner join like: 我想做一个内部联接,如:
    select t1.*, 
           t2.var 
      from t1 t1
inner join t2 t2
        on t1.id = t2.id
       and t1.valid_from >= t2.valid_from
       and t1.valid_to <= t2.valid_to;

This join is really slow. 这种联接真的很慢。 I ran it half a day without any success. 我跑了半天没有成功。 What can I do to increase performance in this particular case? 在这种特殊情况下,我该怎么做才能提高性能? Please note that I also want to left join the resulting table in later stages. 请注意,我还想在以后的阶段中加入生成的表格。 Any help is highly appreciated. 任何帮助都非常感谢。

EDIT 编辑

Obviously, the information I gave was less then generally desired here on the platform. 显然,我给出的信息在平台上通常不太普遍。

  • I use Oracle Database 12c Enterprise Edition 我使用的是Oracle Database 12c企业版
  • The example I gave was illustrative for the bigger problem at hand. 我给出的例子说明了手头的更大问题。 I am concerned with joining information from different tables with different valid_from / valid_to dates. 我担心从不同的表中加入不同的valid_from / valid_to日期的信息。 For this I created a grid first with the distinct values in the valid_from and valid_to variables of all the relevant tables. 为此,我首先使用所有相关表的valid_from和valid_to变量中的不同值创建了一个网格。 This grid is what I refer here to as table 1. 这个网格就是我在这里所说的表1。
  • Results from the execution plan (I adjusted the column and table names to meet the terminology used in my illustrative example): 执行计划的结果(我调整了列和表名称以满足我的说明性示例中使用的术语):
    --------------------------------------------------------------------------------------
    | Id  | Operation          | Name    | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
    --------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT   |         |   465M|    23G|       |   435K  (3)| 00:00:18 |
    |*  1 |  HASH JOIN         |         |   465M|    23G|   695M|   435K  (3)| 00:00:18 |
    |   2 |   TABLE ACCESS FULL| TABLE2 |    16M|   501M|       | 22961   (2)| 00:00:01 |
    |   3 |   TABLE ACCESS FULL| TABLE1 |   132M|  3025M|       |   145K  (2)| 00:00:06 |
    --------------------------------------------------------------------------------------

    Query Block Name / Object Alias (identified by operation id):
    -------------------------------------------------------------

       1 - SEL$58A6D7F6
       2 - SEL$58A6D7F6 / T2@SEL$1
       3 - SEL$58A6D7F6 / T1@SEL$1

    Predicate Information (identified by operation id):
    ---------------------------------------------------

       1 - access("T1"."ID"="T2"."ID")
           filter("T1"."VALID_TO"<="T2"."VALID_TO" AND 
                  "T1"."VALID_FROM">="T2"."VALID_FROM")

    Column Projection Information (identified by operation id):
    -----------------------------------------------------------

       1 - (#keys=1) "T2"."ID"[VARCHAR2,20], 
           "T1"."ID"[VARCHAR2,20], "T1"."VALID_TO"[DATE,7], 
           "T2"."VAR"[VARCHAR2,20], "T2"."VALID_FROM"[DATE,7], 
           "T2"."VALID_TO"[DATE,7], "T1"."ID"[VARCHAR2,20], 
           "T1"."VALID_FROM"[DATE,7], "T1"."VALID_TO"[DATE,7], "T1"."VALID_FROM"[DATE,7]
       2 - "T2"."ID"[VARCHAR2,20], 
           "T2"."VAR"[VARCHAR2,20], "T2"."VALID_FROM"[DATE,7], 
           "T2"."VALID_TO"[DATE,7]
       3 - "T1"."ID"[VARCHAR2,20], "T1"."VALID_FROM"[DATE,7], 
           "T1"."VALID_TO"[DATE,7]

    Note
    -----
       - this is an adaptive plan

A good practice is to ask first: what is expected the query will return? 一个好的做法是首先询问: 查询将返回什么?

Base on your WHERE predicate is seems you are interested on all versions from table2 that are included in the validity interval of table1. 根据您的WHERE谓词,似乎您对table2中包含在table1的有效性间隔中的所有版本感兴趣。 This may be intention, but more common you need all versions that intersect between the tables. 这可能是有意的,但更常见的是您需要在表之间相交的所有版本。

The second aspect is, do you need to see few first rows or all rows from the join. 第二个方面是,您是否需要查看连接中的少数第一行所有行

If you only want to see few results, simple add AND t1.ID = nnnn to the WHERE clause to limit to some sample ID . 如果您只想看到很少的结果, AND t1.ID = nnnn在WHERE子句中添加AND t1.ID = nnnn即可限制某些样本ID If you have proper indexes (and tehre are no expreme lot of rows with this ID), you will get the result quick as NESTED LOOP join will kick in. 如果你有适当的索引(并且tehre没有带有这个ID的最多行),你将获得快速结果,因为NESTED LOOP加入将启动。

To perform the the full result, you must consider all rows from both tables. 要执行完整结果,必须考虑两个表中的所有行 No index will help you to select all rows from a table - here is the FULL TABLE SCAN the best option. 没有索引可以帮助您从表中选择所有行 - 这里是FULL TABLE SCAN的最佳选择。

To join the large row sets the best approach is HASH JOIN . 要加入HASH JOIN集,最好的方法是HASH JOIN NESTED LOOPS (which you probably use now) are quick to join few rows, but hangs on large row sets. NESTED LOOPS (您现在可能会使用它)可以快速连接几行,但在大型行集上。

The smaller table (table2) is red in memory (hopefully) as a hash table. 较小的表(table2)在内存中是红色的(希望)作为哈希表。 The larger table (table1) is probed against this hash table toperform the join. 针对此哈希表探测较大的表(table1)以执行连接。

This is the execution plan you should look for 这是您应该寻找的执行计划

-----------------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    10T|   399T|       |   190M(100)| 02:03:47 |
|*  1 |  HASH JOIN         |      |    10T|   399T|   550M|   190M(100)| 02:03:47 |
|   2 |   TABLE ACCESS FULL| SCD2 |    16M|   355M|       |    39  (93)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| SCD1 |   132M|  2395M|       |   211  (99)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("T1"."ID"="T2"."ID")
       filter("T1"."VALID_FROM">="T2"."VALID_FROM" AND 
              "T1"."VALID_TO"<="T2"."VALID_TO")

Provided you are on an enterprise database this should pass you from days to hours . 如果您在企业数据库中,这应该会让您从几天几小时 Further you can deploy parallel option to get additional speed up. 此外,您可以部署并行选项以获得额外的加速。

Good luck! 祝好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM