简体   繁体   English

保留数据框中的行,对于某些列的值的所有组合,在另一列中包含相同的元素

[英]Keep rows in data frame that, for all combinations of the values of certain columns, contain the same elements in another column

df = pd.DataFrame({'a':['x','x','x','x','x','y','y','y','y','y'],'b':['z','z','z','w','w','z','z','w','w','w'],'c':['c1','c2','c3','c1','c3','c1','c3','c1','c2','c3'],'d':range(1,11)})

   a  b   c   d
0  x  z  c1   1
1  x  z  c2   2
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
8  y  w  c2   9
9  y  w  c3  10

how can I keep only the rows that, for all combinations of a and b , contain the same values in c ?对于ab的所有组合,我怎样才能只保留c中包含相同值的行? Or in other words, how to exclude rows with c values that are only present in some combinations of a and b ?或者换句话说,如何排除仅存在于ab的某些组合中的具有c值的行?

For example, only c1 and c3 are present in all combinations of a and b ( [x,z] , [x,w] , [y,z] , [y,w] ), so the output would be例如,只有c1c3出现在ab的所有组合中( [x,z] , [x,w] , [y,z] , [y,w] ),因此 output 将是

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Here is one way.这是一种方法。 Get unique lists per group and then check common elements across all the returned arrays using reduce and np.intersect1d .获取每个组的唯一列表,然后使用reducenp.intersect1d检查所有返回的 arrays 中的公共元素。 Then filter the dataframe using series.isin and boolean indexing然后使用series.isinboolean indexing过滤 dataframe

from functools import reduce
out = df[df['c'].isin(reduce(np.intersect1d,df.groupby(['a','b'])['c'].unique()))]

Breakdown:分解:

s = df.groupby(['a','b'])['c'].unique()
common_elements = reduce(np.intersect1d,s)
#Returns :-> array(['c1', 'c3'], dtype=object)

out = df[df['c'].isin(common_elements )]#.copy()

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Lets try groupby with nunique to count of unique elements per column c group:让我们尝试使用groupbynunique来计算每列c组的唯一元素:

s = df['a'] + ',' + df['b'] # combination of a, b
m = s.groupby(df['c']).transform('nunique').eq(s.nunique())

df[m]

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Try something diff crosstab尝试一些不同的crosstab

s = pd.crosstab([df['a'],df['b']],df.c).all()
out = df.loc[df.c.isin(s.index[s])]
Out[34]: 
   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Let's try pivot the table, then drop NA , which means a value is missing in the combination:让我们尝试 pivot 表,然后删除NA ,这意味着组合中缺少一个值:

all_data =(df.pivot(index=['a','b'], columns='c', values='c')
             .loc[:, lambda x: x.notna().all()]
             .columns)
df[df['c'].isin(all_data)]

Output: Output:

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

We can use groupby + size and then unstack , which will fill NaN for groups of ['a', 'b'] that are missing a 'c' group.我们可以使用groupby + size然后unstack ,这将为缺少 'c' 组的 ['a', 'b'] 组填充NaN Then we dropna and subset the original DataFrame to the c values that survive the dropna.然后我们丢弃 na 并将原始dropna子集化为在丢弃 na 后幸存下来的c值。

df[df.c.isin(df.groupby(['a', 'b', 'c']).size().unstack(-1).dropna(axis=1).columns)]

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

The result of the groupby operation contains columns only for groups of c that exist in all unique combinations of ['a', 'b'] , so we just grab the columns attribute. groupby 操作的结果仅包含c组的列,这些列存在于['a', 'b']所有唯一组合中,因此我们只获取 columns 属性。

df.groupby(['a', 'b', 'c']).size().unstack(-1).dropna(axis=1)

#c     c1   c3
#a b          
#x w  1.0  1.0
#  z  1.0  1.0
#y w  1.0  1.0
#  z  1.0  1.0

You could use list comprehension with str.contains :您可以将列表理解与str.contains一起使用:

unq = [[x, len(df[(df[['a','b','c']].agg(','.join, axis=1)).str.contains(',' + x)]
                   .drop_duplicates())] for x in df['c'].unique()]
keep = [lst[0] for lst in unq if lst[1] == max([lst[1] for lst in unq])]
df = df[df['c'].isin(keep)]
df

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

If you make the below assumptions this works to give you which elements of column c to keep:如果您做出以下假设,这会为您提供保留 c 列的哪些元素:

df.groupby("c")["a"].count() == df.groupby("c")["a"].count().max()

Output: Output:

c
c1     True
c2    False
c3     True
Name: a, dtype: bool

Assumptions:假设:

  1. There are no duplicates没有重复的
  2. There is at least one value for column c that contains all combinations of a and b.列 c 至少有一个值包含 a 和 b 的所有组合。

You can use value_counts and get all combinations of a and b :您可以使用value_counts并获取ab的所有组合:

vc = df[['a', 'b']].drop_duplicates().value_counts()

Result:结果:

a  b
y  z    1
   w    1
x  z    1
   w    1

Then you can compare counts for each group with vc and filter out groups with missing combinations:然后,您可以将每个组的计数与vc进行比较,并过滤掉具有缺失组合的组:

df.groupby('c').filter(lambda x: x[['a', 'b']].value_counts().ge(vc).all())

Output: Output:

   a  b   c   d
0  x  z  c1   1
2  x  z  c3   3
3  x  w  c1   4
4  x  w  c3   5
5  y  z  c1   6
6  y  z  c3   7
7  y  w  c1   8
9  y  w  c3  10

Assuming there are 4 distinct values as per the example:假设示例中有 4 个不同的值:

A simple solution can be:一个简单的解决方案可以是:

df[df['a'].groupby(df['c']).transform('count').eq(4)]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 对于某些行,将一组列的值复制到同一数据框的另一列 - For certain rows, copying values of a set of columns to another, of the same data frame 如何 append 数据帧的所有行的某些列到另一个 - How to append certain columns of all rows of data frame to another 如果满足基于同一数据帧中其他2列的行值的条件,则在数据帧的列行中填充值 - Filling values in rows of column in a data frame, if condition based on 2 other columns row values in the same data frame is met 如何有条件地根据同一数据帧另一列中的值对Pandas数据帧中的行进行计数? - How to count rows in a data frame in Pandas conditionally against values in another column of the same data frame? 将一个数据帧中的零值列替换为另一个数据帧中的同名列的平均值 - Replace zero valued columns in one data frame with mean values of same name column in another data frame 如何转换 python 数据帧,以便将唯一的行值转置到列,另一列的值成为它们的行 - How to transform python data frame such that unique row values are transposed to columns and values of another column become their rows 保留熊猫数据框的第一行和最后一行重复列值 - Keep first and last rows of repetitive columns values of a panda data frame 熊猫数据框行的所有可能组合 - All possible combinations of pandas data frame rows 如何根据条件统计所有数据框列值并将列转置为 Python 中的行 - How to count all data frame column values based on condition and transpose the columns into rows in Python 在 Pandas 的特定列上用值替换数据框的某些行 - Replace certain rows of data frame with values, on specific column, in Pandas
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM