[英]How to find the top column values of each row in a pandas dataframe
For a given dataframe with m
columns (lets assume m
=10), with in each row, I am trying to find top n
column values (lets assume n
=2).对于具有
m
列的给定 dataframe(假设m
= 10),在每一行中,我试图找到前n
列值(假设n
= 2)。 After finding these top n
values for each row, I would like to assign the remaining column values, m
- n
in total, in the row to 0.在为每一行找到这些前
n
值之后,我想将该行中剩余的列值(总共m
- n
)分配给 0。
For an example, starting with the dataframe of values mentioned in the first table, I am trying to create a representation of first table with the filtering options discussed earlier.例如,从第一个表中提到的值的 dataframe 开始,我尝试使用前面讨论的过滤选项创建第一个表的表示。 If more than
n
columns have same value, lower column index number is given preference如果超过
n
列具有相同的值,则优先考虑较低的列索引号
| col_A | col_B | col_C | col_D | col_E |
|-------|-------|-------|-------|-------|
| 0.1 | 0.1 | 0.3 | 0.4 | 0.5 |
| 0.06 | 0.1 | 0.1 | 0.1 | 0.01 |
| 0.24 | 0.24 | 0.24 | 0.24 | 0.24 |
| 0.20 | 0.25 | 0.30 | 0.12 | 0.02 |
| col_A | col_B | col_C | col_D | col_E |
|-------|-------|-------|-------|-------|
| 0 | 0 | 0 | 0.4 | 0.5 |
| 0 | 0.1 | 0.1 | 0 | 0 |
| 0.24 | 0.24 | 0 | 0 | 0 |
| 0 | 0.25 | 0.3 | 0 | 0 |
Is there any easier way to have this implementation.有没有更简单的方法来实现这个。 A vectorized format can help in dramatically reducing the processing time on large dataframes
矢量化格式有助于显着减少大型数据帧的处理时间
Thanks谢谢
First idea is compare top N values per rows by Series.nlargest
and the nset values by DataFrame.where
:第一个想法是通过
DataFrame.where
比较每行的前 N 个值,通过Series.nlargest
比较 nset 值:
N = 2
df = df.where(df.apply(lambda x: x.eq(x.nlargest(N)), axis=1), 0)
print (df)
col_A col_B col_C col_D col_E
0 0.00 0.00 0.0 0.4 0.5
1 0.00 0.10 0.1 0.0 0.0
2 0.24 0.24 0.0 0.0 0.0
3 0.00 0.25 0.3 0.0 0.0
For increase perfromance is used numpy
, solution from @Divakar:为了提高性能,使用
numpy
,来自@Divakar 的解决方案:
N = 2
#https://stackoverflow.com/a/61518029/2901002
idx = np.argsort(-df.to_numpy(), kind='mergesort')[:,:N]
mask = np.zeros(df.shape, dtype=bool)
np.put_along_axis(mask, idx, True, axis=-1)
df = df.where(mask, 0)
print (df)
col_A col_B col_C col_D col_E
0 0.00 0.00 0.0 0.4 0.5
1 0.00 0.10 0.1 0.0 0.0
2 0.24 0.24 0.0 0.0 0.0
3 0.00 0.25 0.3 0.0 0.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.