在基于 MultiIndex 的 dataframe 的内部级别中获取前 N 个值

Question

I have a Pandas MultiIndex DataFrame that was converted from a xarray Dataset with 3 dimensions being time, latitude and longitude and two variables "FFDI" and "REF_ID").我有一个 Pandas MultiIndex DataFrame，它是从具有时间、纬度和经度三个维度以及两个变量“FFDI”和“REF_ID”的 xarray 数据集转换而来的。 Time = 17696, daily from 1972-01-20 to 2020-06-30) and latitude (=148) and longitude (=244)时间 = 17696，每天从 1972-01-20 到 2020-06-30）和纬度（=148）和经度（=244）

The dataframe looks like: dataframe 看起来像：

                                    FFDI    REF_ID
latitude    longitude   time        
-39.200001  140.800003  2009-02-07  10.2    0
                        2009-01-30  10.1    0
                        1983-02-12  10.0    0
                        2003-01-13  9.8     0
                        2019-12-28  9.8     0
                        2000-01-17  9.7     0
            ...     ...     ...     ...     ...

-33.900002  150.000000  ... ...     ...     ...
                        1994-06-16  0.9     36111
                        1978-07-07  0.2     36111
                        2020-08-28  0.1     36111
                        2007-06-09  0.0     36111
                        1994-07-30  0.0     36111
                        1987-06-21  0.0     36111
                        
639037952 rows × 2 columns

The DataFrame has already been sorted descending on "FFDI". DataFrame 已按“FFDI”降序排序。 What I want to achieve is get top N (say 3) "time" rows for each latitude and longitude.我想要实现的是获得每个纬度和经度的前 N 个（比如 3 个）“时间”行。

So the DataFrame will look like if N = 3:因此，如果 N = 3，DataFrame 将如下所示：

                                    FFDI    REF_ID
latitude    longitude   time        
-39.200001  140.800003  2009-02-07  10.2    0
                        2009-01-30  10.1    0
                        1983-02-12  10.0    0
-39.200001  140.83786   2001-01-03  10.5    0
                        2006-01-18  10.3    0
                        2009-02-07  10.2    0
            ...     ...     ...     ...     ...

-33.900002  150.000000  2009-02-07  10.9    36111
                        2006-01-10  10.7    36111
                        1983-01-23  10.6    36111

Answer 1

Give this a shot:试一试：

df.groupby(level=['latitude','longitude'],
           group_keys=False).apply(lambda x: x.nlargest(n=3,columns=['FFDI','REF_ID']))

The group_keys=False is necessary because you're using the MultiIndex to group, and if set to True -- which is the default -- the groupby() would redundantly add those keys to the index of the output. group_keys=False是必要的，因为您使用 MultiIndex 进行分组，如果设置为True （这是默认设置）， groupby()会将这些键冗余添加到 output 的索引中。

I created a smaller dataset:我创建了一个较小的数据集：

import numpy as np, pandas as pd

latitudes = [-39.200001,-39.200001,-39.200002]*10
longitudes = [140.800003,140.83786,150.000000]*10
sequence = [0,1,5,0,1,2,4,50,0,7]
times = pd.date_range(start='2020-06-01',end='2020-06-30')
 
s = pd.Series(
        np.random.randn(len(sequence)*3),
        index=pd.MultiIndex.from_tuples(zip(latitudes,longitudes,times),
                                        names=['latitude','longitude','time'])
    )

df = pd.DataFrame(s,columns=['FFDI'])
df['REF_ID'] = np.random.randint(0,36111,len(sequence) * 3)

Then tested:然后测试：

In [48]: df.groupby(level=['latitude','longitude'],
                    group_keys=False).apply(lambda x: x.nlargest(n=3,columns=['FFDI','REF_ID']))
Out[48]: 
                                      FFDI  REF_ID
latitude   longitude  time                        
-39.200002 150.000000 2020-06-09  1.658600   32650
                      2020-06-24  1.412439    6124
                      2020-06-06  0.248274   15765
-39.200001 140.800003 2020-06-13  0.906517    6980
                      2020-06-25  0.757745   27483
                      2020-06-04  0.671170   31313
           140.837860 2020-06-20  1.162408   20113
                      2020-06-14  1.014437   34023
                      2020-06-11  0.657841    8366

在基于 MultiIndex 的 dataframe 的内部级别中获取前 N 个值

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-03-17 03:27:56

在基于 MultiIndex 的 dataframe 的内部级别中获取前 N 个值

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-03-17 03:27:56

解决方案1
0 已采纳 2021-03-17 03:27:56