[英]Keep rows in data frame that, for all combinations of the values of certain columns, contain the same elements in another column
df = pd.DataFrame({'a':['x','x','x','x','x','y','y','y','y','y'],'b':['z','z','z','w','w','z','z','w','w','w'],'c':['c1','c2','c3','c1','c3','c1','c3','c1','c2','c3'],'d':range(1,11)})
a b c d
0 x z c1 1
1 x z c2 2
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
8 y w c2 9
9 y w c3 10
how can I keep only the rows that, for all combinations of a
and b
, contain the same values in c
?对于
a
和b
的所有组合,我怎样才能只保留c
中包含相同值的行? Or in other words, how to exclude rows with c
values that are only present in some combinations of a
and b
?或者换句话说,如何排除仅存在于
a
和b
的某些组合中的具有c
值的行?
For example, only c1
and c3
are present in all combinations of a
and b
( [x,z]
, [x,w]
, [y,z]
, [y,w]
), so the output would be例如,只有
c1
和c3
出现在a
和b
的所有组合中( [x,z]
, [x,w]
, [y,z]
, [y,w]
),因此 output 将是
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
Here is one way.这是一种方法。 Get unique lists per group and then check common elements across all the returned arrays using
reduce
and np.intersect1d
.获取每个组的唯一列表,然后使用
reduce
和np.intersect1d
检查所有返回的 arrays 中的公共元素。 Then filter the dataframe using series.isin
and boolean indexing
然后使用
series.isin
和boolean indexing
过滤 dataframe
from functools import reduce
out = df[df['c'].isin(reduce(np.intersect1d,df.groupby(['a','b'])['c'].unique()))]
Breakdown:分解:
s = df.groupby(['a','b'])['c'].unique()
common_elements = reduce(np.intersect1d,s)
#Returns :-> array(['c1', 'c3'], dtype=object)
out = df[df['c'].isin(common_elements )]#.copy()
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
Lets try groupby
with nunique
to count of unique elements per column c
group:让我们尝试使用
groupby
和nunique
来计算每列c
组的唯一元素:
s = df['a'] + ',' + df['b'] # combination of a, b
m = s.groupby(df['c']).transform('nunique').eq(s.nunique())
df[m]
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
Try something diff crosstab
尝试一些不同的
crosstab
s = pd.crosstab([df['a'],df['b']],df.c).all()
out = df.loc[df.c.isin(s.index[s])]
Out[34]:
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
Let's try pivot the table, then drop NA
, which means a value is missing in the combination:让我们尝试 pivot 表,然后删除
NA
,这意味着组合中缺少一个值:
all_data =(df.pivot(index=['a','b'], columns='c', values='c')
.loc[:, lambda x: x.notna().all()]
.columns)
df[df['c'].isin(all_data)]
Output: Output:
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
We can use groupby
+ size
and then unstack
, which will fill NaN
for groups of ['a', 'b'] that are missing a 'c' group.我们可以使用
groupby
+ size
然后unstack
,这将为缺少 'c' 组的 ['a', 'b'] 组填充NaN
。 Then we dropna
and subset the original DataFrame to the c
values that survive the dropna.然后我们丢弃 na 并将原始
dropna
子集化为在丢弃 na 后幸存下来的c
值。
df[df.c.isin(df.groupby(['a', 'b', 'c']).size().unstack(-1).dropna(axis=1).columns)]
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
The result of the groupby operation contains columns only for groups of c
that exist in all unique combinations of ['a', 'b']
, so we just grab the columns attribute. groupby 操作的结果仅包含
c
组的列,这些列存在于['a', 'b']
所有唯一组合中,因此我们只获取 columns 属性。
df.groupby(['a', 'b', 'c']).size().unstack(-1).dropna(axis=1)
#c c1 c3
#a b
#x w 1.0 1.0
# z 1.0 1.0
#y w 1.0 1.0
# z 1.0 1.0
You could use list comprehension with str.contains
:您可以将列表理解与
str.contains
一起使用:
unq = [[x, len(df[(df[['a','b','c']].agg(','.join, axis=1)).str.contains(',' + x)]
.drop_duplicates())] for x in df['c'].unique()]
keep = [lst[0] for lst in unq if lst[1] == max([lst[1] for lst in unq])]
df = df[df['c'].isin(keep)]
df
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
If you make the below assumptions this works to give you which elements of column c to keep:如果您做出以下假设,这会为您提供保留 c 列的哪些元素:
df.groupby("c")["a"].count() == df.groupby("c")["a"].count().max()
Output: Output:
c
c1 True
c2 False
c3 True
Name: a, dtype: bool
Assumptions:假设:
You can use value_counts
and get all combinations of a
and b
:您可以使用
value_counts
并获取a
和b
的所有组合:
vc = df[['a', 'b']].drop_duplicates().value_counts()
Result:结果:
a b
y z 1
w 1
x z 1
w 1
Then you can compare counts for each group with vc
and filter out groups with missing combinations:然后,您可以将每个组的计数与
vc
进行比较,并过滤掉具有缺失组合的组:
df.groupby('c').filter(lambda x: x[['a', 'b']].value_counts().ge(vc).all())
Output: Output:
a b c d
0 x z c1 1
2 x z c3 3
3 x w c1 4
4 x w c3 5
5 y z c1 6
6 y z c3 7
7 y w c1 8
9 y w c3 10
Assuming there are 4 distinct values as per the example:假设示例中有 4 个不同的值:
A simple solution can be:一个简单的解决方案可以是:
df[df['a'].groupby(df['c']).transform('count').eq(4)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.