简体   繁体   中英

Subset SAS dataset with select values from proc freq

Currently I have a file (n obs = 100,000) where the main identifiers are like:

ID  Group
1   a
2   b
3   a
4   c
5   b
6   d
7   d

What I would like to do is create a subset of this datafile. With a proc freq I have identified the top ten (largest size) Group . Is there an easier way to subset the data instead of hard coding the keep observations based on (where numid = count of id by group ):

proc freq data=have order=freq;
    table group;
    where numid > 7;
run;

Thanks!

If you want a table with the records from the groups that belong to the top 10 ranked by frequency, you can do this pretty easily. Just use PROC RANK after the PROC FREQ and join that to the master table. (You could manually grab the top 10 ranks in the PROC SQL, but this seems faster, as PROC RANK is super fast and has all sorts of options that help break ties and such.)

data have;
  call streaminit(7);
  do id = 1 to 1000;
   group = byte(ceil(rand('Uniform')*26)+64); 
   output;
 end;
run;

proc freq data=have;
  tables group/out=group_counts;
run;

proc rank data=group_counts out=ranks descending;
  var count;
  ranks rank;
run;

proc sql;
  create table want as
    select H.* 
    from have H, ranks R
    where H.group=R.group
        and R.rank le 10;
quit;

100k this should be fast enough. If you're in too-big-too-slow data territory, you should instead convert the RANK into a format so you don't have to do a join (and can just use that format to subset the next time you use the data).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM