[英]Python find out records in dataframe by column values greater than or equal to their median in each subgroup
suppose I have a dataframe which could be initiated by: 假设我有一个可以通过以下方式启动的数据框:
df = pd.DataFrame({'group1': ['1','2','3','4','5','6'],
'group2': ['c','c','d','d','d','e'],
'value1': [1.1,2,3,4,5,6],
'value2': [7.1,8,9,10,11,12]
})
df = df.set_index(['group1', 'group2'])
I want to subset df
by the value2
column, the value of which is greater or equal to the median of each sub-group specified by the index of group2
. 我想通过
value2
列对df
进行子集化,其值大于或等于由group2
的索引指定的每个子组的中值。 In this example, the row of group1
in ['2','4','5','6']
should stay in the result. 在此示例中,
['2','4','5','6']
中group1
的行应保留在结果中。 Can anyone help? 有人可以帮忙吗?
This should work: 这应该工作:
df['value2'] = df['value2'].groupby(level='group2').transform(lambda x: np.where(x>=np.median(x), x, np.NaN))
df = df.dropna()
What this does is it gets the value2
column, and splits it into groups by group2
. 它要做的是获取
value2
列,并按group2
将其分成几组。 For each group, it finds the median, then replaces and value below the median with NaN
. 对于每个组,它找到中位数,然后用
NaN
替换并在中位数以下取值。 It then puts this back into the value2
column, then gets rid of all the rows with NaN
values. 然后将其放回
value2
列,然后除去具有NaN
值的所有行。
As an alternative, here is a slightly less clear one-liner: 另外,这里还有一个不太清晰的单线:
df = df.groupby(level='group2').transform(lambda x: x if x.name != 'group2' else np.where(x>=np.median(x), x, np.NaN)).dropna()
This does roughly the same thing, except it runs on both columns, but doesn't do anything to the group1
column. 这样做大致相同,只是在两个列上都运行,但对
group1
列没有任何作用。
Note that in the second approach you could instead store to a second variable, like df2
, without altering the original df
if you prefer. 请注意,在第二种方法中,您可以将其存储到第二个变量(例如
df2
,而无需根据需要更改原始df
。 You could do that with the first approach, but that would require yet another line to make a copy. 您可以使用第一种方法来执行此操作,但是这需要另外一行来进行复制。 This version is much simpler for that case.
对于这种情况,此版本要简单得多。
I think you need to do a groupby and comparison before setting the index: 我认为您需要在设置索引之前进行分组和比较:
df = pd.DataFrame({'group1': ['1','2','3','4','5','6'],
'group2': ['c','c','d','d','d','e'],
'value1': [1.1,2,3,4,5,6],
'value2': [7.1,8,9,10,11,12]
})
gb = df.groupby('group2').value2.median()
df.join(gb, on='group2', rsuffix='_median')
df_filtered = df[df.value2 >= df.join(gb, on='group2', rsuffix='_median').value2_median]
df_filtered.set_index(['group1', 'group2'], inplace=True)
>>> df_filtered
value1 value2
group1 group2
2 c 2 8
4 d 4 10
5 d 5 11
6 e 6 12
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.