[英]Pandas - Delete cells based on ranking within column
I want to delete values based on their relative rank within their column. 我想根据列中的相对排名删除值。 Specifically, I want to isolate the X highest and X lowest values within several columns. 具体来说,我想隔离几列中的X最高值和X最低值。 So if X=2 and my dataframe looks like this: 所以,如果X = 2,我的数据框看起来像这样:
ID Val1 Val2 Val3
001 2 8 14
002 10 15 8
003 3 1 20
004 11 11 7
005 14 4 19
The output should look like this: 输出应如下所示:
ID Val1 Val2 Val3
001 2 NaN NaN
002 NaN 15 8
003 3 1 20
004 11 11 7
005 14 4 19
I know that I can make a sub-table to isolate the high and low rank using: 我知道我可以创建一个子表来隔离高和低排名:
df = df.sort('Column Name')
df2 = df.head(X) # OR: df.tail(X)
And I figure I clear these sub-tables of the values from other columns using: 我想通过以下方法清除其他列中值的这些子表:
df2['Other Column'] = np.NaN
df2['Other Column B'] = np.NaN
Then merge the sub-tables back together in a way that replaces NaN values when there is data in one of the tables. 然后将子表合并回来,以便在其中一个表中存在数据时替换NaN值。 I tried: 我试过了:
df2.update(df3) # df3 is a sub-table made the same way as df2 using a different column
Which only updated rows already present in df2. 其中只有已更新的行已存在于df2中。
I tried: 我试过了:
out = pd.merge(df2, df3, how='outer')
which gave me separate rows when a row appeared in both df2 and d3 当df2和d3中出现一行时,它给了我单独的行
I tried: 我试过了:
out = df2.combine_first(df3)
which over-wrote numerical values with found NaN values in some cases making it unsuitable. 在某些情况下,使用找到的NaN值覆盖了数值,使其不合适。
There must be a way to do this: I want to the original dataframe with NaN values plugged in whenever a value is not among the X highest or X lowest values in that column. 必须有一种方法可以做到这一点: 只要值不在该列中的X最高值或X最低值之间,我想要插入NaN值的原始数据帧。
Interesting question, you can get the index of the values of each columns in the sorted values of each columns (here in the mask
DataFrame
), and then keep the values that have the index within you defined boundary. 有趣的问题是,您可以获取每列的排序值中每列的值的索引(此处在mask
DataFrame
),然后保留索引在您定义的边界内的值。
In [98]:
print df
Val1 Val2 Val3
ID
1 2 8 14
2 10 15 8
3 3 1 20
4 11 11 7
5 14 4 19
In [99]:
mask = df.apply(lambda x: np.searchsorted(sorted(x),x))
print mask
Val1 Val2 Val3
ID
1 0 2 2
2 2 4 1
3 1 0 4
4 3 3 0
5 4 1 3
In [100]:
print (mask<=1)|(mask>=(len(mask)-2))
Val1 Val2 Val3
ID
1 True False False
2 False True True
3 True True True
4 True True True
5 True True True
In [101]:
print df.where((mask<=1)|(mask>=(len(mask)-2)))
Val1 Val2 Val3
ID
1 2 NaN NaN
2 NaN 15 8
3 3 1 20
4 11 11 7
5 14 4 19
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.