如何对熊猫数据框的每一行进行排序并根据行的排序值返回列索引

Question

I am trying to sort each row of pandas dataframe and get the index of sorted values in a new dataframe. 我正在尝试对熊猫数据框的每一行进行排序，并获取新数据框中排序值的索引。 I could do it in a slow way. 我可以用很慢的方式做。 Can anyone suggest improvements using parallelization or vectorized code for this. 谁能为此建议使用并行化或矢量化代码进行改进。 I have posted an example below. 我在下面发布了一个示例。

data_url = ' https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv ' data_url =' https: //raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv'

# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)

# drop categorical column
gapminder.drop(['country', 'continent'], axis=1, inplace=True) 

# print the first three rows
print(gapminder.head(n=3))

   year         pop  lifeExp   gdpPercap
0  1952   8425333.0   28.801  779.445314
1  1957   9240934.0   30.332  820.853030
2  1962  10267083.0   31.997  853.100710

The result I am looking for is this 我正在寻找的结果是这个

tag_0   tag_1   tag_2   tag_3
0   pop year    gdpPercap   lifeExp
1   pop year    gdpPercap   lifeExp
2   pop year    gdpPercap   lifeExp

In this case, since pop is always higher than gdpPercap and lifeExp , it always comes first. 在这种情况下，由于pop始终高于gdpPercap和lifeExp ，因此它始终排在第一位。

I could achieve the required output by using the following code. 通过使用以下代码，我可以实现所需的输出。 But the computation takes longer time if the df has lot of rows/columns. 但是，如果df有很多行/列，则计算会花费更长的时间。

Can anyone suggest an improvement over this 谁能建议对此进行改进

def sort_df(df):
    sorted_tags = pd.DataFrame(index = df.index, columns = ['tag_{}'.format(i) for i in range(df.shape[1])])
    for i in range(df.shape[0]):
        sorted_tags.iloc[i,:] = list( df.iloc[i, :].sort_values(ascending=False).index)
    return sorted_tags

sort_df(gapminder)

Answer 1

This is probably as fast as it gets with numpy: 这可能和numpy一样快：

def sort_df(df):
    return pd.DataFrame(
        data=df.columns.values[np.argsort(-df.values, axis=1)],
        columns=['tag_{}'.format(i) for i in range(df.shape[1])]
    )

print(sort_df(gapminder.head(3)))

  tag_0 tag_1      tag_2    tag_3
0   pop  year  gdpPercap  lifeExp
1   pop  year  gdpPercap  lifeExp
2   pop  year  gdpPercap  lifeExp

Explanation: np.argsort sorts the values along rows, but returns the indices that sort the array instead of sorted values, which can be used for co-sorting arrays. 说明： np.argsort沿行对值进行排序，但返回对数组进行排序的索引，而不是对数组进行排序的索引。 The minus sorts in descending order. 减号按降序排列。 In your case, you use the indices to sort the columns. 在您的情况下，您可以使用索引对列进行排序。 numpy broadcasting takes care of returning the correct shape. numpy广播负责返回正确的形状。

Runtime is around 3ms for your example vs 2.5s with your function. 对于您的示例，运行时间约为3毫秒，而函数运行时约为2.5毫秒。

如何对熊猫数据框的每一行进行排序并根据行的排序值返回列索引

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-11-28 23:45:24

如何对熊猫数据框的每一行进行排序并根据行的排序值返回列索引

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-11-28 23:45:24

解决方案1
1 已采纳 2018-11-28 23:45:24