简体   繁体   English

从SAS中的两个Oracle数据库联接表

[英]Joining tables from two Oracle databases in SAS

I am joining two tables together that are located in two separate oracle databases. 我将位于两个单独的oracle数据库中的两个表连接在一起。

I am currently doing this in sas by creating two libname connections to each database and then simply using something like the below. 我目前在sas中通过创建到每个数据库的两个libname连接,然后简单地使用类似于以下内容的方式来执行此操作。

libname dbase_a oracle user= etc... ;
libname dbase_b oracle user= etc... ;

proc sql;
create table t1 as 

select a.*, b.*
from dbase_a.table1 a inner join dbase_b.table2 b
on a.id = b.id;
quit;

However the query is painfully slow. 但是查询非常缓慢。 Can you suggest any better options to speed up such a query (short of creating a database link going down the path of creating a database link)? 您能否提出任何更好的选择来加快此类查询的速度(缺少创建数据库链接的过程,而不是创建数据库链接的路径)?

Many thanks for looking at this. 非常感谢您对此的关注。

If those two databases are on the same server and you are able to execute cross-database queries in Oracle, you could try using SQL pass-through: 如果这两个数据库位于同一服务器上,并且您能够在Oracle中执行跨数据库查询,则可以尝试使用SQL传递:

proc sql;
connect to oracle (user= password= <...>);
create table t1 as
select * from connection to oracle (
  select a.*, b.*
  from dbase_a.schema_a.table1 a
  inner join dbase_b.schema_b.table2 b
    on a.id = b.id;
);
disconnect from oracle;
quit;

I think that, in most cases, SAS attemps as much as possible to have the query executed on the database server, even if pass-through was not explicitely specified. 我认为,在大多数情况下,即使未明确指定传递,SAS也会尽可能地使查询在数据库服务器上执行。 However, when that query queries tables that are on different servers, different databases on a system that does not allow cross-database queries or if the query contains SAS-specific functions that SAS is not able to translate in something valid on the DBMS system, then SAS will indeed resort to 'downloading' the complete tables and processing the query locally, which can evidently be painfully inefficient. 但是,当该查询查询位于不同服务器上的表,不允许跨数据库查询的系统上不同数据库或查询包含SAS无法转换为DBMS系统上有效内容的SAS特定功能时,那么SAS的确会求助于“下载”完整的表并在本地处理查询,这显然是非常低效的。

The select is for all columns from each table, and the inner join is on the id values only . 选择适用于每个表中的所有列,内部联接仅在id值上 Because the join criteria evaluation is for data coming from disparate sources, the baggage of all columns could be a big factor in the timing because even non-match rows must be downloaded (by the libname engine, within the SQL execution context) during the ON evaluation. 因为联接条件评估是针对来自不同来源的数据,所以所有列的负担可能是时间上的一个重要因素,因为即使是不匹配的行也必须在ON期间下载(由libname引擎在SQL执行上下文中)。评价。

One approach would be to: 一种方法是:

  • Select only the id from each table 从每个表中仅选择ID
  • Find the intersection 找到路口
  • Upload the intersection to each server (as a scratch table) 将交集上传到每个服务器(作为临时表)
  • Utilize the intersection on each server as pass through selection criteria within the final join in SAS 利用每台服务器上的交叉点作为SAS最终联接中的选择标准

There are a couple variations depending on the expected number of id matches, the number of different ids in each table, or knowing table-1 and table-2 as SMALL and BIG. 根据预期的ID匹配数量,每个表中不同ID的数量,或者将table-1和table-2理解为SMALL和BIG,会有一些变化。 For a large number of id matches that need transfer back to a server you will probably want to use some form of bulk copy. 对于需要转移回服务器的大量ID匹配,您可能需要使用某种形式的批量复制。 For a relative small number of ids in the intersection you might get away with enumerating them directly in a SQL statement using the construct IN () . 对于相交中相对较少的id,您可以避免使用结构IN()在SQL语句中直接枚举它们。 The size of a SQL statement could be limited by the database, the SAS/ACCESS to ORACLE engine, the SAS macro system. SQL语句的大小可能受数据库,对ORACLE引擎的SAS / ACCESS,SAS宏系统的限制。

Consider a data scenario in which it has been determined the potential number of matching ids would be too large for a construct in (id-1,...id-n) . 考虑一个数据场景,在该场景中,已经确定匹配id的潜在数量对于(id-1,... id-n)中的构造而言可能太大。 In such a case the list of matching ids are dealt with in a tabular manner: 在这种情况下,以表格方式处理匹配ID列表:

libname SOURCE1 ORACLE ....;
libname SOURCE2 ORACLE ....;

libname SCRATCH1 ORACLE ... must specify a scratch schema ...;
libname SCRATCH2 ORACLE ... must specify a scratch schema ...;

proc sql;
    connect using SOURCE1 as PASS1;
    connect using SOURCE2 as PASS2;

    * compute intersection from only id data sent to SAS;
    create table INTERSECTION as
    (select id from connection to PASS1 (select id from table1))
    intersect
    (select id from connection to PASS2 (select id from table2))
    ;

    * upload intersection to each server;
    create table SCRATCH1.ids as select id from INTERSECTION;
    create table SCRATCH2.ids as select id from INTERSECTION;

    * compute inner join from only data that matches intersection;
    create table INNERJOIN as select ONE.*, TWO.* from
    (select * from connection to PASS1 (
        select * from oracle-path-to-schema.table1 
        where id in (select id from oracle-path-to-scratch.ids)
    ))
    JOIN
    (select * from connection to PASS2 (
        select * from oracle-path-to-schema.table2
        where id in (select id from oracle-path-to-scratch.ids)
    ));
    ...

For the case of both table-1 and table-2 having very large numbers of ids that exceed the resource capacity of your SAS platform you will have to also iterate the approach for ranges of id counts. 对于表1和表2都具有大量ID超过了SAS平台资源容量的情况,您还必须迭代该方法以获得ID计数范围。 Techniques for range criteria determination for each iteration is a tale for another day. 每一次迭代确定范围标准的技术又是另一回事了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM