简体   繁体   中英

SAS - Replicate multiple observations across rows

I have a data structure that looks like this:

DATA have ; 
INPUT famid indid implicate imp_inc; 
CARDS ; 
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
RUN ;

So, we have family id, individual id, implicate number and imputed income for each implicate.

What i need is to replicate the results of the first individual in each family (all of the five implicates) for the remaining individuals within each family, replacing whatever values we previously had on those cells, like this:

DATA want ; 
INPUT famid indid implicate imp_inc; 
CARDS ; 
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 40000
1 2 2 25000
1 2 3 34000
1 2 4 23555
1 2 5 49850
1 3 1 40000
1 3 2 25000
1 3 3 34000
1 3 4 23555
1 3 5 49850
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 40000
2 2 2 45000
2 2 3 50000
2 2 4 34000
2 2 5 23500
2 3 1 40000
2 3 2 45000
2 3 3 50000
2 3 4 34000
2 3 5 23500
RUN ;

In this example I'm trying to replicate only one variable but in my project I will have to do this for dozens of variables.

So far, I came up with this solution:

%let implist_1=imp_inc;

%macro copyv1(list);
    %let nwords=%sysfunc(countw(&list));
    %do i=1 %to &nwords;
    %let varl=%scan(&list, &i);
        proc means data=have max noprint;
            var &varl;
            by famid implicate;
            where indid=1;
            OUTPUT OUT=copy max=max_&varl;  
        run;
        data want;
            set have;
            drop &varl;
        run;
        data want (drop=_TYPE_ _FREQ_);
            merge want copy;
            by famid implicate;
            rename max_&varl=&varl;
        run;
    %end;
%mend;
%copyv1(&imp_list1);

This works well for one or two variables. However it is tremendously slow once you do it for 400 variables in a data-set with the size of 1.5 GB.

I'm pretty sure there is a faster way to do this with some form of proc sql or first.var etc., but i'm relatively new to SAS and so far I couldn't come up with a better solution.

Thank you very much for your support.

Best regards

This is fairly straightforward with a bit of SQL:

proc sql;
create table want as 
  select a.famid, a.indid, a.implicate, b.* from 
  have a 
  left join (
    select * from have 
    group by famid 
    having indid = min(indid)
  ) b 
  on
        a.famid = b.famid 
    and a.implicate = b.implicate
  order by a.famid, a.indid, a.implicate
  ;
quit;

The idea is to join the table to a subset of itself containing only the rows corresponding to the first individual within each family.

It is set up to pick the lowest numbered individual within each family, so it will work even if there is no row with indid = 1. If you are sure that there will always be such a row, you can use a slightly simpler query:

proc sql;
create table want as 
  select a.famid, a.indid, a.implicate, b.* from 
  have(sortedby = famid) a 
  left join have(where = (indid = 1)) b 
  on
        a.famid = b.famid 
    and a.implicate = b.implicate
  order by a.famid, a.indid, a.implicate
  ;
quit;

Specifying sortedby = famid provides a hint to the query optimiser that it can skip one of the initial sorts required for the join, which may improve performance a bit.

Yes, this can be done in DATA step using a first. reference made available via the by statement.

data want;
  set have (keep=famid indid implicate imp_inc /* other vars */);

  by famid indid implicate; /* by implicate is so step logs an error (at run-time) if data not sorted */

  if first.famid then if indid ne 1 then abort;

  array across imp_inc           /* other vars */;
  array hold [1,5] _temporary_;  /* or [<n>,5] where <n> means the number of variables in the across array */

  if indid = 1 then do;          /* hold data for 1st individuals implicate across data */
    do _n_ = 1 to dim(across);
      hold[_n_,implicate] = across[_n_];  /* store info of each implicate of first individual */
    end;
  end;
  else do;
    do _n_ = 1 to dim(across);
      across[_n_] = hold[_n_,implicate];  /* apply 1st persons info to subsequent persons */
    end;
  end;
run;

The DATA step could be significantly faster due to single pass through data, however there is an internal processing cost associated with calculating all those pesky [] array addresses at run; time, and that cost could become impactful at some <n>

SQL is simpler syntax, clearer understanding and works if have data set is unsorted or has some peculiar sequencing in the by group.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM