Python通过大于或等于每个子组中位数的列值找出数据帧中的记录

Question

suppose I have a dataframe which could be initiated by: 假设我有一个可以通过以下方式启动的数据框：

df = pd.DataFrame({'group1': ['1','2','3','4','5','6'],
                   'group2': ['c','c','d','d','d','e'],
                   'value1': [1.1,2,3,4,5,6],
                   'value2': [7.1,8,9,10,11,12]
                   })
df = df.set_index(['group1', 'group2'])

I want to subset df by the value2 column, the value of which is greater or equal to the median of each sub-group specified by the index of group2 . 我想通过value2列对df进行子集化，其值大于或等于由group2的索引指定的每个子组的中值。 In this example, the row of group1 in ['2','4','5','6'] should stay in the result. 在此示例中， ['2','4','5','6']中group1的行应保留在结果中。 Can anyone help? 有人可以帮忙吗？

Answer 1

This should work: 这应该工作：

df['value2'] = df['value2'].groupby(level='group2').transform(lambda x: np.where(x>=np.median(x), x, np.NaN))
df = df.dropna()

What this does is it gets the value2 column, and splits it into groups by group2 . 它要做的是获取value2列，并按group2将其分成几组。 For each group, it finds the median, then replaces and value below the median with NaN . 对于每个组，它找到中位数，然后用NaN替换并在中位数以下取值。 It then puts this back into the value2 column, then gets rid of all the rows with NaN values. 然后将其放回value2列，然后除去具有NaN值的所有行。

As an alternative, here is a slightly less clear one-liner: 另外，这里还有一个不太清晰的单线：

df = df.groupby(level='group2').transform(lambda x: x if x.name != 'group2' else np.where(x>=np.median(x), x, np.NaN)).dropna()

This does roughly the same thing, except it runs on both columns, but doesn't do anything to the group1 column. 这样做大致相同，只是在两个列上都运行，但对group1列没有任何作用。

Note that in the second approach you could instead store to a second variable, like df2 , without altering the original df if you prefer. 请注意，在第二种方法中，您可以将其存储到第二个变量（例如df2 ，而无需根据需要更改原始df 。 You could do that with the first approach, but that would require yet another line to make a copy. 您可以使用第一种方法来执行此操作，但是这需要另外一行来进行复制。 This version is much simpler for that case. 对于这种情况，此版本要简单得多。

Answer 2

I think you need to do a groupby and comparison before setting the index: 我认为您需要在设置索引之前进行分组和比较：

df = pd.DataFrame({'group1': ['1','2','3','4','5','6'],
                   'group2': ['c','c','d','d','d','e'],
                   'value1': [1.1,2,3,4,5,6],
                   'value2': [7.1,8,9,10,11,12]
                   })
gb = df.groupby('group2').value2.median()
df.join(gb, on='group2', rsuffix='_median')
df_filtered = df[df.value2 >= df.join(gb, on='group2', rsuffix='_median').value2_median]
df_filtered.set_index(['group1', 'group2'], inplace=True)
>>> df_filtered 
               value1  value2
group1 group2                
2      c            2       8
4      d            4      10
5      d            5      11
6      e            6      12

Python通过大于或等于每个子组中位数的列值找出数据帧中的记录

问题描述

2 个解决方案

解决方案1
1 2015-03-30 15:34:15

解决方案2
0 2015-03-30 15:00:49

Python通过大于或等于每个子组中位数的列值找出数据帧中的记录

问题描述

2 个解决方案

解决方案1 1 2015-03-30 15:34:15

解决方案2 0 2015-03-30 15:00:49

解决方案1
1 2015-03-30 15:34:15

解决方案2
0 2015-03-30 15:00:49