简体   繁体   English

Python通过大于或等于每个子组中位数的列值找出数据帧中的记录

[英]Python find out records in dataframe by column values greater than or equal to their median in each subgroup

suppose I have a dataframe which could be initiated by: 假设我有一个可以通过以下方式启动的数据框:

df = pd.DataFrame({'group1': ['1','2','3','4','5','6'],
                   'group2': ['c','c','d','d','d','e'],
                   'value1': [1.1,2,3,4,5,6],
                   'value2': [7.1,8,9,10,11,12]
                   })
df = df.set_index(['group1', 'group2'])

I want to subset df by the value2 column, the value of which is greater or equal to the median of each sub-group specified by the index of group2 . 我想通过value2列对df进行子集化,其值大于或等于由group2的索引指定的每个子组的中值。 In this example, the row of group1 in ['2','4','5','6'] should stay in the result. 在此示例中, ['2','4','5','6']group1的行应保留在结果中。 Can anyone help? 有人可以帮忙吗?

This should work: 这应该工作:

df['value2'] = df['value2'].groupby(level='group2').transform(lambda x: np.where(x>=np.median(x), x, np.NaN))
df = df.dropna()

What this does is it gets the value2 column, and splits it into groups by group2 . 它要做的是获取value2列,并按group2将其分成几组。 For each group, it finds the median, then replaces and value below the median with NaN . 对于每个组,它找到中位数,然后用NaN替换并在中位数以下取值。 It then puts this back into the value2 column, then gets rid of all the rows with NaN values. 然后将其放回value2列,然后除去具有NaN值的所有行。

As an alternative, here is a slightly less clear one-liner: 另外,这里还有一个不太清晰的单线:

df = df.groupby(level='group2').transform(lambda x: x if x.name != 'group2' else np.where(x>=np.median(x), x, np.NaN)).dropna()

This does roughly the same thing, except it runs on both columns, but doesn't do anything to the group1 column. 这样做大致相同,只是在两个列上都运行,但对group1列没有任何作用。

Note that in the second approach you could instead store to a second variable, like df2 , without altering the original df if you prefer. 请注意,在第二种方法中,您可以将其存储到第二个变量(例如df2 ,而无需根据需要更改原始df You could do that with the first approach, but that would require yet another line to make a copy. 您可以使用第一种方法来执行此操作,但是这需要另外一行来进行复制。 This version is much simpler for that case. 对于这种情况,此版本要简单得多。

I think you need to do a groupby and comparison before setting the index: 我认为您需要在设置索引之前进行分组和比较:

df = pd.DataFrame({'group1': ['1','2','3','4','5','6'],
                   'group2': ['c','c','d','d','d','e'],
                   'value1': [1.1,2,3,4,5,6],
                   'value2': [7.1,8,9,10,11,12]
                   })
gb = df.groupby('group2').value2.median()
df.join(gb, on='group2', rsuffix='_median')
df_filtered = df[df.value2 >= df.join(gb, on='group2', rsuffix='_median').value2_median]
df_filtered.set_index(['group1', 'group2'], inplace=True)
>>> df_filtered 
               value1  value2
group1 group2                
2      c            2       8
4      d            4      10
5      d            5      11
6      e            6      12

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在python 2.7中查找矩阵中每一列等于或大于1时的行号 - Find row number when each column in a matrix is equal to or greater than 1 in python 2.7 在较大的列中找到大于或等于较短的搜索列中的每个值的第一个值 - Find the first value in larger column greater than or equal to each value in shorter search column Python-Pandas Dataframe:计数值大于或等于 dataframe 中的值 - Python-Pandas Dataframe: count values greater than or equal to a value in the dataframe 如何使用Python中的字典找到值总和等于或大于k的键的可能组合? - How to find possible combination of keys having sum of values equal or greater than k using dictionary in Python? 如何将每行中小于中位数的值归零? - How to zero out values that are less than median in each row? 优化代码以查找DataFrame中每行过去4到6天的值的中位数 - Optimize code to find the median of values of past 4 to 6 days for each row in a DataFrame 在 Python 中找到大于或等于 n 的最小 2 次方 - Find the smallest power of 2 greater than or equal to n in Python 按列比较数据框中的值是否大于系列中的值 - Compare whether values in dataframe are greater than values in a series by column python 中的“大于”或“等于”与“等于”或“大于” - “Greater than” or “equal” vs “equal” or “greater than” in python Python dataframe 检查列值是否大于前一列 - Python dataframe check if column value is greater than previous column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM