Keep rows in data frame that, for all combinations of the values of certain columns, contain the same elements in another column

Question

df = pd.DataFrame({'a':['x','x','x','x','x','y','y','y','y','y'],'b':['z','z','z','w','w','z','z','w','w','w'],'c':['c1','c2','c3','c1','c3','c1','c3','c1','c2','c3'],'d':range(1,11)})

   a  b   c   d
0  x  z  c1   1
1  x  z  c2   2
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
8  y  w  c2   9
9  y  w  c3  10

how can I keep only the rows that, for all combinations of a and b , contain the same values in c ? Or in other words, how to exclude rows with c values that are only present in some combinations of a and b ?

For example, only c1 and c3 are present in all combinations of a and b ( [x,z] , [x,w] , [y,z] , [y,w] ), so the output would be

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 1

Here is one way. Get unique lists per group and then check common elements across all the returned arrays using reduce and np.intersect1d . Then filter the dataframe using series.isin and boolean indexing

from functools import reduce
out = df[df['c'].isin(reduce(np.intersect1d,df.groupby(['a','b'])['c'].unique()))]

Breakdown:

s = df.groupby(['a','b'])['c'].unique()
common_elements = reduce(np.intersect1d,s)
#Returns :-> array(['c1', 'c3'], dtype=object)

out = df[df['c'].isin(common_elements )]#.copy()

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 2

Lets try groupby with nunique to count of unique elements per column c group:

s = df['a'] + ',' + df['b'] # combination of a, b
m = s.groupby(df['c']).transform('nunique').eq(s.nunique())

df[m]

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 3

Try something diff crosstab

s = pd.crosstab([df['a'],df['b']],df.c).all()
out = df.loc[df.c.isin(s.index[s])]
Out[34]: 
   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 4

Let's try pivot the table, then drop NA , which means a value is missing in the combination:

all_data =(df.pivot(index=['a','b'], columns='c', values='c')
             .loc[:, lambda x: x.notna().all()]
             .columns)
df[df['c'].isin(all_data)]

Output:

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 5

We can use groupby + size and then unstack , which will fill NaN for groups of ['a', 'b'] that are missing a 'c' group. Then we dropna and subset the original DataFrame to the c values that survive the dropna.

df[df.c.isin(df.groupby(['a', 'b', 'c']).size().unstack(-1).dropna(axis=1).columns)]

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

The result of the groupby operation contains columns only for groups of c that exist in all unique combinations of ['a', 'b'] , so we just grab the columns attribute.

df.groupby(['a', 'b', 'c']).size().unstack(-1).dropna(axis=1)

#c     c1   c3
#a b          
#x w  1.0  1.0
#  z  1.0  1.0
#y w  1.0  1.0
#  z  1.0  1.0

Answer 6

You could use list comprehension with str.contains :

unq = [[x, len(df[(df[['a','b','c']].agg(','.join, axis=1)).str.contains(',' + x)]
                   .drop_duplicates())] for x in df['c'].unique()]
keep = [lst[0] for lst in unq if lst[1] == max([lst[1] for lst in unq])]
df = df[df['c'].isin(keep)]
df

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 7

If you make the below assumptions this works to give you which elements of column c to keep:

df.groupby("c")["a"].count() == df.groupby("c")["a"].count().max()

Output:

c
c1     True
c2    False
c3     True
Name: a, dtype: bool

Assumptions:

There are no duplicates
There is at least one value for column c that contains all combinations of a and b.

Answer 8

You can use value_counts and get all combinations of a and b :

vc = df[['a', 'b']].drop_duplicates().value_counts()

Result:

Then you can compare counts for each group with vc and filter out groups with missing combinations:

df.groupby('c').filter(lambda x: x[['a', 'b']].value_counts().ge(vc).all())

Output:

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 9

Assuming there are 4 distinct values as per the example:

A simple solution can be:

df[df['a'].groupby(df['c']).transform('count').eq(4)]

Keep rows in data frame that, for all combinations of the values of certain columns, contain the same elements in another column

Question

9 answers

solution1
10 ACCPTED 2021-01-13 18:25:01

solution2
10 2021-01-13 18:30:21

solution3
9 2021-01-13 19:09:48

solution4
8 2021-01-13 18:30:17

solution5
7 2021-01-13 18:49:25

solution6
3 2021-01-13 18:41:36

solution7
1 2021-01-13 18:21:15

solution8
1 2021-01-19 20:54:22

solution9
-1 2021-01-13 19:11:31

Keep rows in data frame that, for all combinations of the values of certain columns, contain the same elements in another column

Question

9 answers

solution1 10 ACCPTED 2021-01-13 18:25:01

solution2 10 2021-01-13 18:30:21

solution3 9 2021-01-13 19:09:48

solution4 8 2021-01-13 18:30:17

solution5 7 2021-01-13 18:49:25

solution6 3 2021-01-13 18:41:36

solution7 1 2021-01-13 18:21:15

solution8 1 2021-01-19 20:54:22

solution9 -1 2021-01-13 19:11:31

solution1
10 ACCPTED 2021-01-13 18:25:01

solution2
10 2021-01-13 18:30:21

solution3
9 2021-01-13 19:09:48

solution4
8 2021-01-13 18:30:17

solution5
7 2021-01-13 18:49:25

solution6
3 2021-01-13 18:41:36

solution7
1 2021-01-13 18:21:15

solution8
1 2021-01-19 20:54:22

solution9
-1 2021-01-13 19:11:31