如何在 pandas dataframe 中查找每一行的顶列值

Question

For a given dataframe with m columns (lets assume m =10), with in each row, I am trying to find top n column values (lets assume n =2).对于具有m列的给定 dataframe（假设m = 10），在每一行中，我试图找到前n列值（假设n = 2）。 After finding these top n values for each row, I would like to assign the remaining column values, m - n in total, in the row to 0.在为每一行找到这些前n值之后，我想将该行中剩余的列值（总共m - n ）分配给 0。

For an example, starting with the dataframe of values mentioned in the first table, I am trying to create a representation of first table with the filtering options discussed earlier.例如，从第一个表中提到的值的 dataframe 开始，我尝试使用前面讨论的过滤选项创建第一个表的表示。 If more than n columns have same value, lower column index number is given preference如果超过n列具有相同的值，则优先考虑较低的列索引号

| col_A | col_B | col_C | col_D | col_E |
|-------|-------|-------|-------|-------|
| 0.1   | 0.1   | 0.3   | 0.4   | 0.5   |
| 0.06  | 0.1   | 0.1   | 0.1   | 0.01  |
| 0.24  | 0.24  | 0.24  | 0.24  | 0.24  |
| 0.20  | 0.25  | 0.30  | 0.12  | 0.02  |

| col_A | col_B | col_C | col_D | col_E |
|-------|-------|-------|-------|-------|
| 0     | 0     | 0     | 0.4   | 0.5   |
| 0     | 0.1   | 0.1   | 0     | 0     |
| 0.24  | 0.24  | 0     | 0     | 0     |
| 0     | 0.25  | 0.3   | 0     | 0     |

Is there any easier way to have this implementation.有没有更简单的方法来实现这个。 A vectorized format can help in dramatically reducing the processing time on large dataframes矢量化格式有助于显着减少大型数据帧的处理时间

Thanks谢谢

Answer 1

First idea is compare top N values per rows by Series.nlargest and the nset values by DataFrame.where :第一个想法是通过DataFrame.where比较每行的前 N 个值，通过Series.nlargest比较 nset 值：

N = 2
df = df.where(df.apply(lambda x: x.eq(x.nlargest(N)), axis=1), 0)
print (df)
   col_A  col_B  col_C  col_D  col_E
0   0.00   0.00    0.0    0.4    0.5
1   0.00   0.10    0.1    0.0    0.0
2   0.24   0.24    0.0    0.0    0.0
3   0.00   0.25    0.3    0.0    0.0

For increase perfromance is used numpy , solution from @Divakar:为了提高性能，使用numpy ，来自@Divakar 的解决方案：

N = 2
#https://stackoverflow.com/a/61518029/2901002
idx = np.argsort(-df.to_numpy(), kind='mergesort')[:,:N]
mask = np.zeros(df.shape, dtype=bool)
np.put_along_axis(mask, idx, True, axis=-1)
df = df.where(mask, 0)
print (df)
   col_A  col_B  col_C  col_D  col_E
0   0.00   0.00    0.0    0.4    0.5
1   0.00   0.10    0.1    0.0    0.0
2   0.24   0.24    0.0    0.0    0.0
3   0.00   0.25    0.3    0.0    0.0

如何在 pandas dataframe 中查找每一行的顶列值

问题描述

1 个解决方案

解决方案1
4 已采纳 2020-04-30 06:20:48

如何在 pandas dataframe 中查找每一行的顶列值

问题描述

1 个解决方案

解决方案1 4 已采纳 2020-04-30 06:20:48

解决方案1
4 已采纳 2020-04-30 06:20:48