简体   繁体   中英

SAS - Data Step equivalent of Proc SQL

What would be the data step equivalent of this proc sql?

proc sql;
create table issues2 as(
select request,
       area,
       sum(issue_count) as issue_count,
       sum(resolved_count) as resolved_count
    from 
        issues1
    group by request, area 
                         );

PROC MEANS/SUMMARY is better, but if it's relevant, the actual data step solution is as follows. Basically you just reset the counter to 0 on first.<var> and output on last.<var> , where <var> is the last variable in your by group.

Note: This assumes the data is sorted by request area . Sort it if it is not.

data issues2(rename=(issue_count_sum=issue_count resolved_count_sum=resolved_count) drop=issue_count resolved_count);
 set issues1;
 by request area;
 if first.area then do;
   issue_count_sum=0;
   resolved_count_sum=0;
 end;
 issue_count_sum+issue_count;
 resolved_count_sum+resolved_count;
 if last.area then output;
run;

The functional equivalent of what you're trying to do is the following:

data _null_;
  set issues1(rename=(issue_count=_issue_count
                      resolved_count=_resolved_count)) end=done;

  if _n_=1 then do;
    declare hash total_issues();
    total_issues.defineKey("request", "area");
    total_issues.defineData("request", "area", "issue_count", "resolved_count");
    total_issues.defineDone();
  end;

  if total_issues.find() ne 0 then do;
    issue_count = _issue_count;
    resolved_count = _resolved_count;
  end;
  else do;
    issue_count + _issue_count;
    resolved_count + _resolved_count;
  end;

  total_issues.replace();

  if done then total_issues.output(dataset: "issues2");
run;

This method does not require you to to pre-sort the dataset. I wanted to see what kind of performance you'd get with using different methods so I did a few tests on a 74M row dataset. I got the following run-times (your results may vary):

Unsorted Dataset:

  • Proc SQL - 12.18 Seconds
  • Data Step With Hash Object Method (above) - 26.68 Seconds
  • Proc Means using a class statement (nway) - 5.13 Seconds

Sorted Dataset (36.94 Seconds to do a proc sort):

  • Proc SQL - 10.82 Seconds
  • Proc Means using a by statement - 9.31 Seconds
  • Proc Means using a class statement (nway) - 6.07 Seconds
  • Data Step using by statement (I used the code from Joe's answer) - 8.97 Seconds

As you can see, I wouldn't recommend using the data step with the hash object method shown above since it took twice as long as the proc sql.

I'm not sure why proc means with a by statement took longer then proc means with a class statement, but I tried this on a bunch of different datasets and saw similar differences in runtimes (I'm using SAS 9.3 on Linux 64).

Something to keep in mind is that these runtimes might be completely different for your situation but I would recommend using the the following code to do the summation:

proc means data=issues1 noprint nway;
  class request area;
  var issue_count resolved_count;
  output out=issues2(drop=_:) sum=;
run;

Awkward, I think, to do it in a data step at all - summing and resetting variables at each level of the by variables would work. A hash object might also do the trick.

Perhaps the simplest non-Proc SQL method would be to use Proc Summary:-

proc summary data = issues1 nway missing;
  class request area;
  var issue_count resolved_count;
  output out = issues2 sum(issue_count) = issue_count sum(resolved_count) = resolved_count ;
run;

Here's the temporary array method. This is the "simplest" of them, making some assumptions about the request and area values; if those assumptions are faulty, as they often are in real data, it may not be quite as easy as this. Note that while in the below the data does happen to be sorted, I don't rely on it being sorted and the process don't gain any advantage from it being sorted.

data issues1;
do request=1 to 1e5;
  do area = 1 to 7;
    do issueNum = 1 to 1e2;
      issue_count = floor(rand('Uniform')*7);
      resolved_count = floor(rand('Uniform')*issue_count);
      output;
    end;
  end;
end;
run;

data issues2;
set issues1 end=done;
array ra_issue[1100000] _temporary_;
array ra_resolved[1100000] _temporary_;
*array index = request||area, so request 9549 area 6 = 95496.;
ra_issue[input(cats(request,area),best7.)] + issue_count;
ra_resolved[input(cats(request,area),best7.)] + resolved_count;
if done then do;
  do _t = 1 to dim(ra_issue);
    if not missing(ra_issue[_t]) then do;
      request = floor(_t/10);
      area    = mod(_t,10);
      issue_count=ra_issue[_t];
      resolved_count=ra_resolved[_t];
      output;
      keep request area issue_count resolved_count;
    end;
  end;
end;
run;

That performed comparably to PROC MEANS with CLASS, given the simple data I started it with. If you can't trivially generate a key from a combination of area and request (if they're character variables, for example), you would have to store another array of name-to-key relationships which would make it quite a lot slower if there are a lot of combinations (though if there are relatively few combinations, it's not necessarily all that bad). If for some reason you were doing this in production, I would first create a table of unique request+area combinations, create a Format and an Informat to convert back and forth from a unique key (which should be very fast AND give you a reliable index), and then do this using that format/informat rather than the cats / division-modulus that I do here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM