简体   繁体   English

如何在SAS中优化proc sql join?

[英]How to optimize proc sql join in SAS?

I have 2 datasets: ta with 390K rows and 1 variable and tb with 60 million rows and 350 variables. 我有2个数据集:ta有390K行和1个变量,而tb有6000万行和350个变量。 I need to join this datasets quickly, but my query is too slow. 我需要快速加入此数据集,但查询速度太慢。

How I can optimize query? 如何优化查询?

My query: 我的查询:

    proc sql;
    create table с as 
    select distinct a.REP_CLID, b.REP_DATE, &Score_Column, b.REP_AGE as AGE 
    from a (IDXWHERE =Yes) ,
    &b (IDXWHERE =Yes)  
    where a.rep_clid = b.rep_clid 

Since you already have an index on rep_clid in your large table b, this seems like a good candidate for a data step key merge. 由于您已经在大型表b中的rep_clid上建立了索引,因此这似乎是数据步骤键合并的理想选择。 Tweak as required so you're just keeping variables of interest: 根据需要进行调整,因此您只需保留感兴趣的变量:

data c;
  set a;
  set b key = rep_clid; /*requires unique index on rep_clid to work properly*/
  if _IORC_ then do;
    _ERROR_ = 0;
    delete;
  end;
run;

That will return only records with rep_clid present in both a and b. 那将只返回a和b中都存在rep_clid的记录。 You can then deduplicate via proc sort with the nodupkey option. 然后,您可以使用nodupkey选项通过proc sort进行重复数据删除。

If you have a non-unique index on b, it can still be made to work, but the syntax is a bit more complex: 如果您在b上具有非唯一索引,则仍然可以使它工作,但是语法稍微复杂一些:

data c;
  set a;
  do until(eof);
    set b key = rep_clid end = eof; /*will work with non-unique index on rep_clid*/
    if _IORC_ then do;
      _ERROR_ = 0;
      delete;
    end;
    else output;
  end;
run;

Performance can be a tricky issue as there a huge number of factors that can affect it. 性能可能是一个棘手的问题,因为会影响性能的因素很多。 It is often a case of trying numerous ways to achieve the same thing until you find the best performing method. 在找到最佳方法之前,通常尝试多种方法来实现同一目标。

Are both of these tables SAS tables or one or both third party DBMS tables? 这两个表都是SAS表还是一个或两个第三方DBMS表? This opens up a whole world of performance issues I will leave until confirmed. 这将打开一个性能问题的世界,我将在确认之前将其保留。

Assuming they are both SAS tables try re-writing your query like this if you only want columns from table B, assuming &Score_Column is in table B. If not then this will not work. 假设它们都是SAS表,并且如果只希望表B中的列,则尝试以这种方式重新编写查询,并假设&Score_Column位于表B中。否则,这将不起作用。

proc sql; 
    create table с as 
    select distinct b.REP_CLID, b.REP_DATE, &Score_Column, b.REP_AGE as AGE 
    from   &b   (IDXWHERE =Yes) as      b   
    where   b.rep_clid  in 
            (   select  a.rep_clid 
                from    a (IDXWHERE =Yes) 
            ) 
    ; 
Quit; 

Alternatively you could use a proc format as has been suggested. 或者,您可以按照建议使用proc格式。 This example will work if &Score_Column is in table a but could be easily modified if it is not. 如果&Score_Column在表a中,此示例将起作用,但如果不在表中,则可以轻松对其进行修改。

Proc sql; 
        create table    rep_clid_fmt    as 
        select  distinct        'rep_clid_fmt'  as fmtname 
        ,       rep_clid                        as      start 

/*              If &Score_Column  is in table a then use &Score_Column  as the label */ 
        ,       &Score_Column                   as      label 

                else use a flag like... 
        ,       'keep'                          as      label 

        from    a 
        ; 
Quit; 

Proc format cntlin=rep_clid_fmt; 
Run; 


proc sql; 
    create table с as 
    select distinct b.REP_CLID 
        ,  b.REP_DATE 
        , put (b.REP_DATE,rep_clid_fmt) as      &Score_Column 
        , b.REP_AGE as AGE 
    from   &b   (IDXWHERE =Yes) as      b   
    where   put (b.REP_DATE,rep_clid_fmt)       ne substr (b.REP_DATE,1,length(put(b.REP_DATE,rep_clid_fmt)) 
    ; 
Quit; 

Good luck! 祝好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM