df = pd.DataFrame({'a':['x','x','x','x','x','y','y','y','y','y'],'b':['z','z','z','w','w','z','z','w','w','w'],'c':['c1','c2','c3','c1','c3','c1','c3','c1','c2','c3'],'d':range(1,11)})
a b c d
0 x z c1 1
1 x z c2 2
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
8 y w c2 9
9 y w c3 10
how can I keep only the rows that, for all combinations of a
and b
, contain the same values in c
? Or in other words, how to exclude rows with c
values that are only present in some combinations of a
and b
?
For example, only c1
and c3
are present in all combinations of a
and b
( [x,z]
, [x,w]
, [y,z]
, [y,w]
), so the output would be
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
Here is one way. Get unique lists per group and then check common elements across all the returned arrays using reduce
and np.intersect1d
. Then filter the dataframe using series.isin
and boolean indexing
from functools import reduce
out = df[df['c'].isin(reduce(np.intersect1d,df.groupby(['a','b'])['c'].unique()))]
Breakdown:
s = df.groupby(['a','b'])['c'].unique()
common_elements = reduce(np.intersect1d,s)
#Returns :-> array(['c1', 'c3'], dtype=object)
out = df[df['c'].isin(common_elements )]#.copy()
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
Lets try groupby
with nunique
to count of unique elements per column c
group:
s = df['a'] + ',' + df['b'] # combination of a, b
m = s.groupby(df['c']).transform('nunique').eq(s.nunique())
df[m]
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
Try something diff crosstab
s = pd.crosstab([df['a'],df['b']],df.c).all()
out = df.loc[df.c.isin(s.index[s])]
Out[34]:
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
Let's try pivot the table, then drop NA
, which means a value is missing in the combination:
all_data =(df.pivot(index=['a','b'], columns='c', values='c')
.loc[:, lambda x: x.notna().all()]
.columns)
df[df['c'].isin(all_data)]
Output:
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
We can use groupby
+ size
and then unstack
, which will fill NaN
for groups of ['a', 'b'] that are missing a 'c' group. Then we dropna
and subset the original DataFrame to the c
values that survive the dropna.
df[df.c.isin(df.groupby(['a', 'b', 'c']).size().unstack(-1).dropna(axis=1).columns)]
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
The result of the groupby operation contains columns only for groups of c
that exist in all unique combinations of ['a', 'b']
, so we just grab the columns attribute.
df.groupby(['a', 'b', 'c']).size().unstack(-1).dropna(axis=1)
#c c1 c3
#a b
#x w 1.0 1.0
# z 1.0 1.0
#y w 1.0 1.0
# z 1.0 1.0
You could use list comprehension with str.contains
:
unq = [[x, len(df[(df[['a','b','c']].agg(','.join, axis=1)).str.contains(',' + x)]
.drop_duplicates())] for x in df['c'].unique()]
keep = [lst[0] for lst in unq if lst[1] == max([lst[1] for lst in unq])]
df = df[df['c'].isin(keep)]
df
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
If you make the below assumptions this works to give you which elements of column c to keep:
df.groupby("c")["a"].count() == df.groupby("c")["a"].count().max()
Output:
c
c1 True
c2 False
c3 True
Name: a, dtype: bool
Assumptions:
You can use value_counts
and get all combinations of a
and b
:
vc = df[['a', 'b']].drop_duplicates().value_counts()
Result:
a b
y z 1
w 1
x z 1
w 1
Then you can compare counts for each group with vc
and filter out groups with missing combinations:
df.groupby('c').filter(lambda x: x[['a', 'b']].value_counts().ge(vc).all())
Output:
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
Assuming there are 4 distinct values as per the example:
A simple solution can be:
df[df['a'].groupby(df['c']).transform('count').eq(4)]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.