保留數據框中的行，對於某些列的值的所有組合，在另一列中包含相同的元素

Question

df = pd.DataFrame({'a':['x','x','x','x','x','y','y','y','y','y'],'b':['z','z','z','w','w','z','z','w','w','w'],'c':['c1','c2','c3','c1','c3','c1','c3','c1','c2','c3'],'d':range(1,11)})

   a  b   c   d
0  x  z  c1   1
1  x  z  c2   2
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
8  y  w  c2   9
9  y  w  c3  10

對於a和b的所有組合，我怎樣才能只保留c中包含相同值的行？ 或者換句話說，如何排除僅存在於a和b的某些組合中的具有c值的行？

例如，只有c1和c3出現在a和b的所有組合中（ [x,z] , [x,w] , [y,z] , [y,w] ），因此 output 將是

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 1

這是一種方法。 獲取每個組的唯一列表，然后使用reduce和np.intersect1d檢查所有返回的 arrays 中的公共元素。 然后使用series.isin和boolean indexing過濾 dataframe

from functools import reduce
out = df[df['c'].isin(reduce(np.intersect1d,df.groupby(['a','b'])['c'].unique()))]

分解：

s = df.groupby(['a','b'])['c'].unique()
common_elements = reduce(np.intersect1d,s)
#Returns :-> array(['c1', 'c3'], dtype=object)

out = df[df['c'].isin(common_elements )]#.copy()

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 2

讓我們嘗試使用groupby和nunique來計算每列c組的唯一元素：

s = df['a'] + ',' + df['b'] # combination of a, b
m = s.groupby(df['c']).transform('nunique').eq(s.nunique())

df[m]

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 3

嘗試一些不同的crosstab

s = pd.crosstab([df['a'],df['b']],df.c).all()
out = df.loc[df.c.isin(s.index[s])]
Out[34]: 
   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 4

讓我們嘗試 pivot 表，然后刪除NA ，這意味着組合中缺少一個值：

all_data =(df.pivot(index=['a','b'], columns='c', values='c')
             .loc[:, lambda x: x.notna().all()]
             .columns)
df[df['c'].isin(all_data)]

Output：

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 5

我們可以使用groupby + size然后unstack ，這將為缺少 'c' 組的 ['a', 'b'] 組填充NaN 。 然后我們丟棄 na 並將原始dropna子集化為在丟棄 na 后幸存下來的c值。

df[df.c.isin(df.groupby(['a', 'b', 'c']).size().unstack(-1).dropna(axis=1).columns)]

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

groupby 操作的結果僅包含c組的列，這些列存在於['a', 'b']所有唯一組合中，因此我們只獲取 columns 屬性。

df.groupby(['a', 'b', 'c']).size().unstack(-1).dropna(axis=1)

#c     c1   c3
#a b          
#x w  1.0  1.0
#  z  1.0  1.0
#y w  1.0  1.0
#  z  1.0  1.0

Answer 6

您可以將列表理解與str.contains一起使用：

unq = [[x, len(df[(df[['a','b','c']].agg(','.join, axis=1)).str.contains(',' + x)]
                   .drop_duplicates())] for x in df['c'].unique()]
keep = [lst[0] for lst in unq if lst[1] == max([lst[1] for lst in unq])]
df = df[df['c'].isin(keep)]
df

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 7

如果您做出以下假設，這會為您提供保留 c 列的哪些元素：

df.groupby("c")["a"].count() == df.groupby("c")["a"].count().max()

Output：

c
c1     True
c2    False
c3     True
Name: a, dtype: bool

假設：

沒有重復的
列 c 至少有一個值包含 a 和 b 的所有組合。

Answer 8

您可以使用value_counts並獲取a和b的所有組合：

vc = df[['a', 'b']].drop_duplicates().value_counts()

結果：

然后，您可以將每個組的計數與vc進行比較，並過濾掉具有缺失組合的組：

df.groupby('c').filter(lambda x: x[['a', 'b']].value_counts().ge(vc).all())

Output：

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Answer 9

假設示例中有 4 個不同的值：

一個簡單的解決方案可以是：

df[df['a'].groupby(df['c']).transform('count').eq(4)]

保留數據框中的行，對於某些列的值的所有組合，在另一列中包含相同的元素

問題描述

9 個解決方案

解決方案1
10 已采納 2021-01-13 18:25:01

解決方案2
10 2021-01-13 18:30:21

解決方案3
9 2021-01-13 19:09:48

解決方案4
8 2021-01-13 18:30:17

解決方案5
7 2021-01-13 18:49:25

解決方案6
3 2021-01-13 18:41:36

解決方案7
1 2021-01-13 18:21:15

解決方案8
1 2021-01-19 20:54:22

解決方案9
-1 2021-01-13 19:11:31

保留數據框中的行，對於某些列的值的所有組合，在另一列中包含相同的元素

問題描述

9 個解決方案

解決方案1 10 已采納 2021-01-13 18:25:01

解決方案2 10 2021-01-13 18:30:21

解決方案3 9 2021-01-13 19:09:48

解決方案4 8 2021-01-13 18:30:17

解決方案5 7 2021-01-13 18:49:25

解決方案6 3 2021-01-13 18:41:36

解決方案7 1 2021-01-13 18:21:15

解決方案8 1 2021-01-19 20:54:22

解決方案9 -1 2021-01-13 19:11:31

解決方案1
10 已采納 2021-01-13 18:25:01

解決方案2
10 2021-01-13 18:30:21

解決方案3
9 2021-01-13 19:09:48

解決方案4
8 2021-01-13 18:30:17

解決方案5
7 2021-01-13 18:49:25

解決方案6
3 2021-01-13 18:41:36

解決方案7
1 2021-01-13 18:21:15

解決方案8
1 2021-01-19 20:54:22

解決方案9
-1 2021-01-13 19:11:31