简体   繁体   中英

SAS to delete observations that meet condition within group

I want to delete records in the Have dataset which meets all the following conditions. ID_num here stands for the 3-digit part of the ID field

  • ID = Mxxx
  • Type = blood
  • located prior to any of the following records WITHIN EACH GROUP OF ( ID_num , drug) .
    • ID=Mxxx and Type=milk
    • ID=Infxxx

Below are Have and the desired output.

data have;
     input ID $ Type $ Drug $;
     cards;
M001    blood A
M001    blood A
M001    blood A
M001    blood B
M001    blood B
M001    milk  B
M001    blood C
M001    blood C
M002    blood A
M002    blood A
Inf002  blood A
M002    blood A
M002    blood B
M002    milk  C
Inf003  blood B
M003    blood B
;
run;
data want;
     input ID $ Type $ Drug $;
     cards;
M001    milk   B
Inf002  blood  A
M002    blood  A
M002    milk   C
Inf003  blood  B
M003    blood  B
;
run;

For example, the M002 (blood, drug A) that is under the inf002 drug A observation stays because it occurs after an infant sample in the same drug group. But two M002 (blood, A) observations above it should get deleted as they occur before the first infant sample in same drug group. Conversely, the two M001 (blood, C) observations following M001 (milk, B) should be deleted as the drug groups are different.

Edit: group by ( gp , Drug ).

Keys

  1. Extract the ID grouping number ( gp in the code) using SAS regex ( prxmatch(patt, var) here).

  2. The keep condition can be examined row-by-row while also grouped by ( gp , Drug ). A change in gp is identified by FIRST.drug .

    • The dataset must be sorted before the use of BY statement. Since SAS sorting is stable, the original ordering won't break.
    • The original ordering can be tracked by recording _n_ in the regex parsing phase.

Code

* "have" is in your post;
data tmp;
    set have;
    pos = prxmatch('(\d{3})', ID);
    gp = substr(ID, pos, pos+2);  * group number;
    mi = substr(ID, 1, 1);  * mother or infant;
    n = _n_; * keep track of the original ordering;
    drop pos;
run;

proc sort data=tmp out=tmp;
    by gp drug;
run;

data want(drop=flag_keep gp mi);
    set tmp;
    by gp drug;
    * state variables;
    retain flag_keep 0;
    if FIRST.drug then flag_keep = 0;
    * mark keep;
    if (flag_keep = 1) or (mi = "I") or ((mi = "M") and (Type = "milk"))
        then flag_keep = 1;
    if flag_keep = 1 then output;
run;

proc sort data=want out=want;
    by n;
run;

Result: the original row number n is shown for clarity.

   ID      Type   Drug  n
1  M001    milk   B     6    
2  Inf002  blood  A     11    
3  M002    blood  A     12    
4  M002    milk   C     14    
5  Inf003  blood  B     15    
6  M003    blood  B     16

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM