简体   繁体   English

优化SAS Proc SQL查询

[英]Optimize SAS Proc SQL query

I have 2 large tables that I am trying to join in order to group the first records based on a field from the second table. 我试图加入2个大表,以便根据第二个表中的字段对第一条记录进行分组。 The left table has approx.50 mil records of events, the right table has approx.35 mil records of monthly intervals. 左表具有大约5,000万条事件记录,右表具有大约3500万条每月间隔记录。 The monthly intervals are at subjID level, thus I cannot reduce the size of the right table by keeping only start and end dates. 每月间隔处于subjID级别,因此我无法仅保留开始日期和结束日期来减小右表的大小。 Currently it takes about 40 - 60 minutes to perform the join. 目前,执行加入大约需要40-60分钟。

I tried to create simple indexes on subjID, eventDate, startDate and endDate, however it did not seem to improve performance (creating the indexes completed in about 5 minutes, join completed in 38 minutes). 我尝试在subjID,eventDate,startDate和endDate上创建简单的索引,但是它似乎并没有提高性能(创建索引的时间大约是5分钟,而联接的时间是38分钟)。

Is there any other option I could use to improve processing? 我还有其他选择可以用来改善处理效果吗?

Left Table of events at subjID level: 在subjID级别的事件左表:

data eventsTable;
input @1 subjID 8.
    @10 eventDate date9.;
format eventDate mmddyy10.;
datalines;
101      01AUG2011
101      28AUG2011
101      30AUG2011
101      01SEP2011
101      12SEP2011
101      28SEP2011
102      01JAN2015
102      15JAN2015
102      01FEB2015
102      16FEB2015
;
run;

Right Table of monthly intervals at subjID level. 在subjID级别的每月间隔右表。 I am trying to bring endDate to events if the events occurred between start and end date: 如果事件发生在开始日期和结束日期之间,我正在尝试将endDate带入事件:

data monthlyTable;
input @1 subjID 8.
    @10 startDate date9. 
    @22 endDate date9.;
format startDate endDate mmddyy10.;
datalines;
101      28JUL2011   30AUG2011
101      30AUG2011   28SEP2011
101      28SEP2011   28OCT2011
102      01DEC2014   02JAN2015
102      02JAN2015   02FEB2015
102      02FEB2015   02MAR2015
;
run;

Output: 输出:

proc sql;
create table wantTable as 
    select a.*,
        endDate as monthlyDate
    from eventsTable a left join monthlyTable b on 
        a.subjID = b.subjID
    where a.eventDate > b.startDate and a.eventDate <= b.endDate
        order by subjID, eventDate;
quit;

If you have enough memory and you only need the enddate from monthlyTable , you might find that a format merge is a more efficient way of doing this. 如果你有足够的内存,你只需要在enddatemonthlyTable ,你可能会发现的格式合并是这样做的更有效的方法。 However, if both datasets are large, there's only so much optimisation you can hope for as you always have to do at least full read of each. 但是,如果两个数据集都很大,那么您只能期望有太多的优化,因为您总是必须至少完全读取每个数据集。

data t_format(keep = fmtname--hlo) /view = t_format;
  set monthlytable(keep = subjID startdate enddate) end = eof;
  retain fmtname 'myinfmt' type 'i';
  length start end $18; /*Increase for IDs longer than 8 digits*/
  start = cats(put(subjID,z8.),put(startdate + 1,yymmdd10.));
  end   = cats(put(subjID,z8.),put(enddate,yymmdd10.));
  label = enddate;
  output;
  if eof then do;
    hlo = 'O';
    label = .N;
    output;
  end;
run;

proc format cntlin = t_format;
run;

data want;
  set eventstable;
  enddate = input(cats(put(subjID,z8.),put(eventdate,yymmdd10.)),myinfmt18.);
  format enddate yymmdd10.;
run;

Note the use of the yymmdd10. 请注意yymmdd10.的使用yymmdd10. and z8. z8. formats - these ensure that keys are always the same length, avoiding ambiguity, and that the ranges of lookup values are correctly specified in ascending order when creating the numeric informat myinfmt . 格式-这些格式可确保键始终具有相同的长度,避免产生歧义,并且在创建数字信息myinfmt升序正确指定查找值的范围。 I suppose, strictly speaking, this is an informat merge rather than a format merge, but it's the same sort of idea. 严格来说,我想这是一个信息合并而不是格式合并,但这是同一种想法。

If you want to return multiple lookup variables via this approach, you'll need to concatenate them together when defining the format and then split them after applying it. 如果要通过这种方法返回多个查找变量,则需要在定义格式时将它们串联在一起,然后在应用格式后将其拆分。

I would estimate that this approach requires about 1.5GB of memory for the datasets you've specified - ie (18 bytes x 2 per date range + 8 bytes for the formatted value) x 35m rows. 我估计这种方法需要为您指定的数据集存储约1.5GB的内存-即(18个字节x 2个日期范围+ 8个字节的格式化值)x 35m行。 Depending on the length of your IDs this may differ a bit. 根据您ID的长度,这可能会有所不同。

If you need multiple lookup values then you can do a similar thing using a hash merge, but I suspect the format merge is more efficient in this case. 如果需要多个查找值,则可以使用哈希合并来执行类似的操作,但是我怀疑在这种情况下格式合并会更有效。

One possible hash merge approach looks like this: 一种可能的哈希合并方法如下所示:

data t_lookup /view= t_lookup;
  set monthlytable;
  by subjID;
  if first.subjID then id_range_count = 0;
  id_range_count + 1;
run;

data want;
  set eventstable;
  if _n_ = 1 then do;
    if 0 then set monthlytable(keep = subjID startdate enddate); /*Add extra lookup vars here as needed*/
    declare hash h(dataset:"t_lookup");
    rc = h.definekey("subjID","id_range_count");
    rc = h.definedata("startdate","enddate"); /*Add extra lookup vars here as needed*/
    rc = h.definedone();
  end;
  match = 0;
  rc    = 0;
  do id_range_count = 1 by 1 while(rc = 0 and match = 0);
    rc = h.find();
    match = startdate < eventdate <= enddate;
  end;
  if match = 0 then call missing(startdate,enddate);
  drop rc match id_range_count;
run;

The best index for your query is a composite index on monthlyTable(subjId, startDate, endDate) . 对于您的查询而言,最佳索引是对monthlyTable(subjId, startDate, endDate)的综合索引。 I'm not sure if it will be a big improvement in terms of performance in SAS, however. 但是,我不确定这对SAS的性能是否会有很大的改善。

I've had better luck with pre-sorting data sets than with creating indexes. 预排序数据集比创建索引要好。 However, pre-sorting can take a long time depending on the size of the data sets and what you are sorting them on. 但是,预排序可能会花费很长时间,具体取决于数据集的大小以及要对它们进行排序的内容。 It can take longer than the original SQL query, so testing becomes important. 它可能需要比原始SQL查询更长的时间,因此测试变得很重要。

Try running 尝试跑步

PROC SORT DATA=eventsTable ;
  BY subjID eventDate ;
RUN ;

PROC SORT DATA=monthlyTable ;
  BY subjID startDate endDate ;
RUN ;

before your PROC SQL. 在您的PROC SQL之前。 The only explanation I have is that SAS recognizes the SORT BY header information and doesn't need to scan entire tables looking for joins since a given subjID would probably only be on a few consecutive pages. 我唯一的解释是,SAS可以识别SORT BY标头信息,并且无需扫描整个表以查找联接,因为给定的subjID可能仅在几个连续的页面上。 Being on a few consecutive pages also decreases I/O. 在几个连续的页面上也会减少I / O。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM