[英]Python Pandas dataframe: get highest n non-NaN values for each column and the indices for those values
I have a pandas dataframe with values from multiple locations across many days.我有一个 pandas dataframe ,其值来自多个位置的多天。
import pandas as pd
import numpy as np
df = pd.DataFrame({'day': [1, 2, 3, 4, 5, 6],
'location-1': [10, 24, 24, 85, 90, np.NaN],
'location-2': [np.NaN, np.NaN, 45, 28, np.NaN, np.NaN]})
df.set_index('day', inplace=True)
I need to get the 4 highest values at each location , and the days on which they occur.我需要在每个位置获取 4 个最高值,以及它们发生的日期。 NaN values need to be placed last. NaN值需要放在最后。 Something along the lines of:类似于以下内容:
result = pd.DataFrame({'location-1': [90, 85, 24, 24],
'location-2': [45, 29, np.NaN, np.NaN]})
result_days = pd.DataFrame({'location-1': [5, 4, 3, 2],
'location-2': [3, 4, 6, 5]})
I found a partial solution here: Get top 4 biggest values from each column using Pandas in Python我在这里找到了部分解决方案: Get top 4 maximum values from each column using Pandas in Python
But that solution sorts NaN to the top and I can't find an na_position option for this.但是该解决方案将 NaN 排序到顶部,我找不到 na_position 选项。 I saw solutions that then manually cycle each column's NaN down to the bottom, but I have no prior knowledge of which columns contain NaN and also have to keep track of the days.我看到了解决方案,然后手动将每列的 NaN 循环到底部,但我不知道哪些列包含 NaN,并且还必须跟踪日期。 I can't use dropna because one location may have important values on the day that another location has NaN.我不能使用 dropna,因为在另一个位置具有 NaN 的那一天,一个位置可能具有重要值。
My questions are:我的问题是:
This is my first time asking a question, and I am happy to clarify/change anything.这是我第一次提出问题,我很高兴澄清/改变任何事情。 Apologies if this is a duplicate;抱歉,如果这是重复的; I hadn't found ones with the same situation.我没有找到有同样情况的人。 Thanks!谢谢!
The following loop will give you what you need.以下循环将为您提供所需的内容。 You sort_values
each location and assign it to the proper result
and result_days
您sort_values
每个位置进行排序值并将其分配给正确的result
和result_days
cols = ['location-1', 'location-2']
result = pd.DataFrame(columns=cols)
result_days = pd.DataFrame(columns=cols)
for c in cols:
tmp = df.sort_values(c, ascending=False).head(4)
result[c] = tmp[c].values
result_days[c] = tmp.index.values
print(result)
print(result_days)
location-1 location-2
0 90.0 45.0
1 85.0 28.0
2 24.0 NaN
3 24.0 NaN
location-1 location-2
0 5 3
1 4 4
2 2 1
3 3 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.