简体   繁体   中英

Delete group by two variables on condition in SAS / SQL

I'm looking for a solution for following problem. I'm using SAS, therefore a basic SQL or Datastep approach is both welcomed. Maybe the solution is simple, but I'm kinda new to SAS and can't find a solution.

I got a dataset and want to remove a subgroup on second level by a condition. So for making it easier, let me explain on an example. The condition is: When any value in ColC is 1, then remove the subgroup in the maingroup. The main group is ColA and the subgroup is ColB

ColA | ColB | ColC
  1  |  a   |  0  
  1  |  a   |  1  
  1  |  b   |  0  
  1  |  b   |  0  
  2  |  a   |  0  
  2  |  a   |  0  
  2  |  b   |  0  
  2  |  b   |  0  
  3  |  a   |  0  
  3  |  a   |  0  
  3  |  b   |  1  
  3  |  b   |  0  

Expected output:

ColA | ColB | ColC
  1  |  b   |  0  
  1  |  b   |  0  
  2  |  a   |  0  
  2  |  a   |  0  
  2  |  b   |  0  
  2  |  b   |  0  
  3  |  a   |  0  
  3  |  a   |  0  

I tried approaches like:

select * from data
group by ColA, ColB having ColC <> 1

Which I thought, will group by the two columns and select all groups without ColC= 1. But it "removes" only the rows with ColC=1.

Another approach is something like this:

select * from data
where ColA in (select ColA from data where ColC <> 1)

But of course, I can't reach the subgroups with this. I also was thinking about a join, but not sure how to do it.

You can use not exists with a correlated subquery:

select d.*
from data d
where not exists (select 1
                  from data d2 
                  where d2.cola = d.cola and d2.colb = d.colb and d2.colc = 1
                 );

This keeps all combinations of cola / colb that do not have a 1 in colc .

This can also be adapted to a delete , but you seem to want a filtered result set.

The having clause in SQL will allow you filter a query by a summary function. The below query says to only include output where the sum of ColC is 0 after grouping by ColA and ColB .

proc sql noprint;
    create table want as 
        select *
        from have
        group by ColA, ColB
        having sum(ColC) = 0
    ;
quit;

Here is a data step approach using a double DoW loop

data have;
input ColA ColB $ ColC;
infile datalines dlm='|';
datalines;
  1  |  a   |  0  
  1  |  a   |  1  
  1  |  b   |  0  
  1  |  b   |  0  
  2  |  a   |  0  
  2  |  a   |  0  
  2  |  b   |  0  
  2  |  b   |  0  
  3  |  a   |  0  
  3  |  a   |  0  
  3  |  b   |  1  
  3  |  b   |  0  
;

data want (drop=c);
    c = 1;
    do _n_ = 1 by 1 until (last.ColB);
        set have;
        by ColA ColB;
        if ColC = 1 then c = 0;
    end;
    do _n_ = 1 to _n_;
        set have;
        if c then output;
    end;
run;

A simple way to do it with common code:

proc sort data=have;
   by cola colb;

data want;
   merge have (in=in1 where=(colc=1))
         have (in=in2)
         ;
   by cola colb;       
   if ^in1;
run;

The first HAVE selects all records with COLC=1, and since we are merging by COLA and COLB the IF statement will remove all records with the same COLA and COLB, which is the goal.

Also, a Hash Object approach

data want;
    if _n_ = 1 then do;
        declare hash h (dataset : 'have(where=(ColC=1))');
        h.definekey ('ColA', 'ColB');
        h.definedone();
    end;
    set have;
    if h.check();
run;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM