[英]How to sort each row of pandas dataframe and return column index based on sorted values of row
I am trying to sort each row of pandas dataframe and get the index of sorted values in a new dataframe. 我正在尝试对熊猫数据框的每一行进行排序,并获取新数据框中排序值的索引。 I could do it in a slow way. 我可以用很慢的方式做。 Can anyone suggest improvements using parallelization or vectorized code for this. 谁能为此建议使用并行化或矢量化代码进行改进。 I have posted an example below. 我在下面发布了一个示例。
data_url = ' https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv ' data_url =' https: //raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv'
# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)
# drop categorical column
gapminder.drop(['country', 'continent'], axis=1, inplace=True)
# print the first three rows
print(gapminder.head(n=3))
year pop lifeExp gdpPercap
0 1952 8425333.0 28.801 779.445314
1 1957 9240934.0 30.332 820.853030
2 1962 10267083.0 31.997 853.100710
The result I am looking for is this 我正在寻找的结果是这个
tag_0 tag_1 tag_2 tag_3
0 pop year gdpPercap lifeExp
1 pop year gdpPercap lifeExp
2 pop year gdpPercap lifeExp
In this case, since pop
is always higher than gdpPercap
and lifeExp
, it always comes first. 在这种情况下,由于pop
始终高于gdpPercap
和lifeExp
,因此它始终排在第一位。
I could achieve the required output by using the following code. 通过使用以下代码,我可以实现所需的输出。 But the computation takes longer time if the df
has lot of rows/columns. 但是,如果df
有很多行/列,则计算会花费更长的时间。
Can anyone suggest an improvement over this 谁能建议对此进行改进
def sort_df(df):
sorted_tags = pd.DataFrame(index = df.index, columns = ['tag_{}'.format(i) for i in range(df.shape[1])])
for i in range(df.shape[0]):
sorted_tags.iloc[i,:] = list( df.iloc[i, :].sort_values(ascending=False).index)
return sorted_tags
sort_df(gapminder)
This is probably as fast as it gets with numpy: 这可能和numpy一样快:
def sort_df(df):
return pd.DataFrame(
data=df.columns.values[np.argsort(-df.values, axis=1)],
columns=['tag_{}'.format(i) for i in range(df.shape[1])]
)
print(sort_df(gapminder.head(3)))
tag_0 tag_1 tag_2 tag_3
0 pop year gdpPercap lifeExp
1 pop year gdpPercap lifeExp
2 pop year gdpPercap lifeExp
Explanation: np.argsort
sorts the values along rows, but returns the indices that sort the array instead of sorted values, which can be used for co-sorting arrays. 说明: np.argsort
沿行对值进行排序,但返回对数组进行排序的索引,而不是对数组进行排序的索引。 The minus sorts in descending order. 减号按降序排列。 In your case, you use the indices to sort the columns. 在您的情况下,您可以使用索引对列进行排序。 numpy broadcasting takes care of returning the correct shape. numpy广播负责返回正确的形状。
Runtime is around 3ms for your example vs 2.5s with your function. 对于您的示例,运行时间约为3毫秒,而函数运行时约为2.5毫秒。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.