Delete group by two variables on condition in SAS / SQL

Question

I'm looking for a solution for following problem. I'm using SAS, therefore a basic SQL or Datastep approach is both welcomed. Maybe the solution is simple, but I'm kinda new to SAS and can't find a solution.

I got a dataset and want to remove a subgroup on second level by a condition. So for making it easier, let me explain on an example. The condition is: When any value in ColC is 1, then remove the subgroup in the maingroup. The main group is ColA and the subgroup is ColB

ColA | ColB | ColC
  1  |  a   |  0  
  1  |  a   |  1  
  1  |  b   |  0  
  1  |  b   |  0  
  2  |  a   |  0  
  2  |  a   |  0  
  2  |  b   |  0  
  2  |  b   |  0  
  3  |  a   |  0  
  3  |  a   |  0  
  3  |  b   |  1  
  3  |  b   |  0

Expected output:

ColA | ColB | ColC
  1  |  b   |  0  
  1  |  b   |  0  
  2  |  a   |  0  
  2  |  a   |  0  
  2  |  b   |  0  
  2  |  b   |  0  
  3  |  a   |  0  
  3  |  a   |  0

I tried approaches like:

select * from data
group by ColA, ColB having ColC <> 1

Which I thought, will group by the two columns and select all groups without ColC= 1. But it "removes" only the rows with ColC=1.

Another approach is something like this:

select * from data
where ColA in (select ColA from data where ColC <> 1)

But of course, I can't reach the subgroups with this. I also was thinking about a join, but not sure how to do it.

Answer 1

You can use not exists with a correlated subquery:

select d.*
from data d
where not exists (select 1
                  from data d2 
                  where d2.cola = d.cola and d2.colb = d.colb and d2.colc = 1
                 );

This keeps all combinations of cola / colb that do not have a 1 in colc .

This can also be adapted to a delete , but you seem to want a filtered result set.

Answer 2

The having clause in SQL will allow you filter a query by a summary function. The below query says to only include output where the sum of ColC is 0 after grouping by ColA and ColB .

proc sql noprint;
    create table want as 
        select *
        from have
        group by ColA, ColB
        having sum(ColC) = 0
    ;
quit;

Answer 3

Here is a data step approach using a double DoW loop

data have;
input ColA ColB $ ColC;
infile datalines dlm='|';
datalines;
  1  |  a   |  0  
  1  |  a   |  1  
  1  |  b   |  0  
  1  |  b   |  0  
  2  |  a   |  0  
  2  |  a   |  0  
  2  |  b   |  0  
  2  |  b   |  0  
  3  |  a   |  0  
  3  |  a   |  0  
  3  |  b   |  1  
  3  |  b   |  0  
;

data want (drop=c);
    c = 1;
    do _n_ = 1 by 1 until (last.ColB);
        set have;
        by ColA ColB;
        if ColC = 1 then c = 0;
    end;
    do _n_ = 1 to _n_;
        set have;
        if c then output;
    end;
run;

Answer 4

A simple way to do it with common code:

proc sort data=have;
   by cola colb;

data want;
   merge have (in=in1 where=(colc=1))
         have (in=in2)
         ;
   by cola colb;       
   if ^in1;
run;

The first HAVE selects all records with COLC=1, and since we are merging by COLA and COLB the IF statement will remove all records with the same COLA and COLB, which is the goal.

Answer 5

Also, a Hash Object approach

data want;
    if _n_ = 1 then do;
        declare hash h (dataset : 'have(where=(ColC=1))');
        h.definekey ('ColA', 'ColB');
        h.definedone();
    end;
    set have;
    if h.check();
run;

Delete group by two variables on condition in SAS / SQL

Question

5 answers

solution1
3 2020-06-07 21:48:13

solution2
3 ACCPTED 2020-06-07 23:12:49

solution3
2 2020-06-08 09:17:38

solution4
2 2020-06-08 12:49:46

solution5
1 2020-06-08 09:19:59

Delete group by two variables on condition in SAS / SQL

Question

5 answers

solution1 3 2020-06-07 21:48:13

solution2 3 ACCPTED 2020-06-07 23:12:49

solution3 2 2020-06-08 09:17:38

solution4 2 2020-06-08 12:49:46

solution5 1 2020-06-08 09:19:59

solution1
3 2020-06-07 21:48:13

solution2
3 ACCPTED 2020-06-07 23:12:49

solution3
2 2020-06-08 09:17:38

solution4
2 2020-06-08 12:49:46

solution5
1 2020-06-08 09:19:59