简体   繁体   English

Pandas - 根据列内的排名删除单元格

[英]Pandas - Delete cells based on ranking within column

I want to delete values based on their relative rank within their column. 我想根据列中的相对排名删除值。 Specifically, I want to isolate the X highest and X lowest values within several columns. 具体来说,我想隔离几列中的X最高值和X最低值。 So if X=2 and my dataframe looks like this: 所以,如果X = 2,我的数据框看起来像这样:

ID    Val1    Val2    Val3    
001   2       8       14      
002   10      15      8
003   3       1       20
004   11      11      7
005   14      4       19

The output should look like this: 输出应如下所示:

ID    Val1    Val2    Val3    
001   2       NaN     NaN      
002   NaN     15      8
003   3       1       20
004   11      11      7
005   14      4       19

I know that I can make a sub-table to isolate the high and low rank using: 我知道我可以创建一个子表来隔离高和低排名:

df = df.sort('Column Name')
df2 = df.head(X) # OR: df.tail(X)

And I figure I clear these sub-tables of the values from other columns using: 我想通过以下方法清除其他列中值的这些子表:

df2['Other Column'] = np.NaN
df2['Other Column B'] = np.NaN

Then merge the sub-tables back together in a way that replaces NaN values when there is data in one of the tables. 然后将子表合并回来,以便在其中一个表中存在数据时替换NaN值。 I tried: 我试过了:

df2.update(df3) # df3 is a sub-table made the same way as df2 using a different column

Which only updated rows already present in df2. 其中只有已更新的行已存在于df2中。

I tried: 我试过了:

out = pd.merge(df2, df3, how='outer')

which gave me separate rows when a row appeared in both df2 and d3 当df2和d3中出现一行时,它给了我单独的行

I tried: 我试过了:

out = df2.combine_first(df3)

which over-wrote numerical values with found NaN values in some cases making it unsuitable. 在某些情况下,使用找到的NaN值覆盖了数值,使其不合适。

There must be a way to do this: I want to the original dataframe with NaN values plugged in whenever a value is not among the X highest or X lowest values in that column. 必须有一种方法可以做到这一点: 只要值不在该列中的X最高值或X最低值之间,我想要插入NaN值的原始数据帧。

Interesting question, you can get the index of the values of each columns in the sorted values of each columns (here in the mask DataFrame ), and then keep the values that have the index within you defined boundary. 有趣的问题是,您可以获取每列的排序值中每列的值的索引(此处在mask DataFrame ),然后保留索引在您定义的边界内的值。

In [98]:
print df
    Val1  Val2  Val3
ID                  
1      2     8    14
2     10    15     8
3      3     1    20
4     11    11     7
5     14     4    19
In [99]:

mask = df.apply(lambda x: np.searchsorted(sorted(x),x))
print mask
    Val1  Val2  Val3
ID                  
1      0     2     2
2      2     4     1
3      1     0     4
4      3     3     0
5      4     1     3
In [100]:

print (mask<=1)|(mask>=(len(mask)-2))
     Val1   Val2   Val3
ID                     
1    True  False  False
2   False   True   True
3    True   True   True
4    True   True   True
5    True   True   True
In [101]:

print df.where((mask<=1)|(mask>=(len(mask)-2)))
    Val1  Val2  Val3
ID                  
1      2   NaN   NaN
2    NaN    15     8
3      3     1    20
4     11    11     7
5     14     4    19

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM