简体   繁体   English

Python Pandas dataframe:获取每列的最高 n 个非 NaN 值以及这些值的索引

[英]Python Pandas dataframe: get highest n non-NaN values for each column and the indices for those values

I have a pandas dataframe with values from multiple locations across many days.我有一个 pandas dataframe ,其值来自多个位置的多天。

import pandas as pd
import numpy as np
df = pd.DataFrame({'day': [1, 2, 3, 4, 5, 6],
                   'location-1': [10, 24, 24, 85, 90, np.NaN],
                   'location-2': [np.NaN, np.NaN, 45, 28, np.NaN, np.NaN]})
df.set_index('day', inplace=True)

I need to get the 4 highest values at each location , and the days on which they occur.我需要在每个位置获取 4 个最高值,以及它们发生的日期。 NaN values need to be placed last. NaN值需要放在最后。 Something along the lines of:类似于以下内容:

result = pd.DataFrame({'location-1': [90, 85, 24, 24],
                       'location-2': [45, 29, np.NaN, np.NaN]})
result_days = pd.DataFrame({'location-1': [5, 4, 3, 2],
                            'location-2': [3, 4, 6, 5]})
    

I found a partial solution here: Get top 4 biggest values from each column using Pandas in Python我在这里找到了部分解决方案: Get top 4 maximum values from each column using Pandas in Python

But that solution sorts NaN to the top and I can't find an na_position option for this.但是该解决方案将 NaN 排序到顶部,我找不到 na_position 选项。 I saw solutions that then manually cycle each column's NaN down to the bottom, but I have no prior knowledge of which columns contain NaN and also have to keep track of the days.我看到了解决方案,然后手动将每列的 NaN 循环到底部,但我不知道哪些列包含 NaN,并且还必须跟踪日期。 I can't use dropna because one location may have important values on the day that another location has NaN.我不能使用 dropna,因为在另一个位置具有 NaN 的那一天,一个位置可能具有重要值。

My questions are:我的问题是:

  1. How do I sort this efficiently and extract the highest non-NaN values?如何有效地排序并提取最高的非 NaN 值? I can hack it and replace NaN with -999 prior to sorting but I'd like to see if a general solution exists that doesn't rely on an assumption that my numbers are above a certain value.我可以破解它并在排序之前用 -999 替换 NaN,但我想看看是否存在不依赖于假设我的数字高于某个值的通用解决方案。
  2. How do I efficiently pull out the Days (or row indices) for the values in question 1?如何有效地提取问题 1 中的值的天数(或行索引)? There may be repeat high values (as in location-1), and for that I need to go by latest day first.可能有重复的高值(如位置 1),为此我需要在最近一天之前 go。 I have seen some solutions with np.argsort and np.argpartition, but I think they may hinge on how NaN are dealt with here.我已经看到了一些使用 np.argsort 和 np.argpartition 的解决方案,但我认为它们可能取决于这里如何处理 NaN。

This is my first time asking a question, and I am happy to clarify/change anything.这是我第一次提出问题,我很高兴澄清/改变任何事情。 Apologies if this is a duplicate;抱歉,如果这是重复的; I hadn't found ones with the same situation.我没有找到有同样情况的人。 Thanks!谢谢!

The following loop will give you what you need.以下循环将为您提供所需的内容。 You sort_values each location and assign it to the proper result and result_dayssort_values每个位置进行排序值并将其分配给正确的resultresult_days

cols = ['location-1', 'location-2']
result = pd.DataFrame(columns=cols)
result_days  = pd.DataFrame(columns=cols)

for c in cols:
    tmp = df.sort_values(c, ascending=False).head(4)
    result[c] = tmp[c].values
    result_days[c] = tmp.index.values

print(result)
print(result_days)

   location-1  location-2
0        90.0        45.0
1        85.0        28.0
2        24.0         NaN
3        24.0         NaN
   location-1  location-2
0           5           3
1           4           4
2           2           1
3           3           2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM