保留数据框中的行，对于某些列的值的所有组合，在另一列中包含相同的元素

Question

df = pd.DataFrame({'a':['x','x','x','x','x','y','y','y','y','y'],'b':['z','z','z','w','w','z','z','w','w','w'],'c':['c1','c2','c3','c1','c3','c1','c3','c1','c2','c3'],'d':range(1,11)})

   a  b   c   d
0  x  z  c1   1
1  x  z  c2   2
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
8  y  w  c2   9
9  y  w  c3  10

how can I keep only the rows that, for all combinations of a and b , contain the same values in c ?对于a和b的所有组合，我怎样才能只保留c中包含相同值的行？ Or in other words, how to exclude rows with c values that are only present in some combinations of a and b ?或者换句话说，如何排除仅存在于a和b的某些组合中的具有c值的行？

For example, only c1 and c3 are present in all combinations of a and b ( [x,z] , [x,w] , [y,z] , [y,w] ), so the output would be例如，只有c1和c3出现在a和b的所有组合中（ [x,z] , [x,w] , [y,z] , [y,w] ），因此 output 将是

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 1

Here is one way.这是一种方法。 Get unique lists per group and then check common elements across all the returned arrays using reduce and np.intersect1d .获取每个组的唯一列表，然后使用reduce和np.intersect1d检查所有返回的 arrays 中的公共元素。 Then filter the dataframe using series.isin and boolean indexing然后使用series.isin和boolean indexing过滤 dataframe

from functools import reduce
out = df[df['c'].isin(reduce(np.intersect1d,df.groupby(['a','b'])['c'].unique()))]

Breakdown:分解：

s = df.groupby(['a','b'])['c'].unique()
common_elements = reduce(np.intersect1d,s)
#Returns :-> array(['c1', 'c3'], dtype=object)

out = df[df['c'].isin(common_elements )]#.copy()

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 2

Lets try groupby with nunique to count of unique elements per column c group:让我们尝试使用groupby和nunique来计算每列c组的唯一元素：

s = df['a'] + ',' + df['b'] # combination of a, b
m = s.groupby(df['c']).transform('nunique').eq(s.nunique())

df[m]

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 3

Try something diff crosstab尝试一些不同的crosstab

s = pd.crosstab([df['a'],df['b']],df.c).all()
out = df.loc[df.c.isin(s.index[s])]
Out[34]: 
   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 4

Let's try pivot the table, then drop NA , which means a value is missing in the combination:让我们尝试 pivot 表，然后删除NA ，这意味着组合中缺少一个值：

all_data =(df.pivot(index=['a','b'], columns='c', values='c')
             .loc[:, lambda x: x.notna().all()]
             .columns)
df[df['c'].isin(all_data)]

Output: Output：

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 5

We can use groupby + size and then unstack , which will fill NaN for groups of ['a', 'b'] that are missing a 'c' group.我们可以使用groupby + size然后unstack ，这将为缺少 'c' 组的 ['a', 'b'] 组填充NaN 。 Then we dropna and subset the original DataFrame to the c values that survive the dropna.然后我们丢弃 na 并将原始dropna子集化为在丢弃 na 后幸存下来的c值。

df[df.c.isin(df.groupby(['a', 'b', 'c']).size().unstack(-1).dropna(axis=1).columns)]

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

The result of the groupby operation contains columns only for groups of c that exist in all unique combinations of ['a', 'b'] , so we just grab the columns attribute. groupby 操作的结果仅包含c组的列，这些列存在于['a', 'b']所有唯一组合中，因此我们只获取 columns 属性。

df.groupby(['a', 'b', 'c']).size().unstack(-1).dropna(axis=1)

#c     c1   c3
#a b          
#x w  1.0  1.0
#  z  1.0  1.0
#y w  1.0  1.0
#  z  1.0  1.0

Answer 6

You could use list comprehension with str.contains :您可以将列表理解与str.contains一起使用：

unq = [[x, len(df[(df[['a','b','c']].agg(','.join, axis=1)).str.contains(',' + x)]
                   .drop_duplicates())] for x in df['c'].unique()]
keep = [lst[0] for lst in unq if lst[1] == max([lst[1] for lst in unq])]
df = df[df['c'].isin(keep)]
df

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 7

If you make the below assumptions this works to give you which elements of column c to keep:如果您做出以下假设，这会为您提供保留 c 列的哪些元素：

df.groupby("c")["a"].count() == df.groupby("c")["a"].count().max()

Output: Output：

c
c1     True
c2    False
c3     True
Name: a, dtype: bool

Assumptions:假设：

There are no duplicates没有重复的
There is at least one value for column c that contains all combinations of a and b.列 c 至少有一个值包含 a 和 b 的所有组合。

Answer 8

You can use value_counts and get all combinations of a and b :您可以使用value_counts并获取a和b的所有组合：

vc = df[['a', 'b']].drop_duplicates().value_counts()

Result:结果：

Then you can compare counts for each group with vc and filter out groups with missing combinations:然后，您可以将每个组的计数与vc进行比较，并过滤掉具有缺失组合的组：

df.groupby('c').filter(lambda x: x[['a', 'b']].value_counts().ge(vc).all())

Output: Output：

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 9

Assuming there are 4 distinct values as per the example:假设示例中有 4 个不同的值：

A simple solution can be:一个简单的解决方案可以是：

df[df['a'].groupby(df['c']).transform('count').eq(4)]

保留数据框中的行，对于某些列的值的所有组合，在另一列中包含相同的元素

问题描述

9 个解决方案

解决方案1
10 已采纳 2021-01-13 18:25:01

解决方案2
10 2021-01-13 18:30:21

解决方案3
9 2021-01-13 19:09:48

解决方案4
8 2021-01-13 18:30:17

解决方案5
7 2021-01-13 18:49:25

解决方案6
3 2021-01-13 18:41:36

解决方案7
1 2021-01-13 18:21:15

解决方案8
1 2021-01-19 20:54:22

解决方案9
-1 2021-01-13 19:11:31

保留数据框中的行，对于某些列的值的所有组合，在另一列中包含相同的元素

问题描述

9 个解决方案

解决方案1 10 已采纳 2021-01-13 18:25:01

解决方案2 10 2021-01-13 18:30:21

解决方案3 9 2021-01-13 19:09:48

解决方案4 8 2021-01-13 18:30:17

解决方案5 7 2021-01-13 18:49:25

解决方案6 3 2021-01-13 18:41:36

解决方案7 1 2021-01-13 18:21:15

解决方案8 1 2021-01-19 20:54:22

解决方案9 -1 2021-01-13 19:11:31

解决方案1
10 已采纳 2021-01-13 18:25:01

解决方案2
10 2021-01-13 18:30:21

解决方案3
9 2021-01-13 19:09:48

解决方案4
8 2021-01-13 18:30:17

解决方案5
7 2021-01-13 18:49:25

解决方案6
3 2021-01-13 18:41:36

解决方案7
1 2021-01-13 18:21:15

解决方案8
1 2021-01-19 20:54:22

解决方案9
-1 2021-01-13 19:11:31