使用groupby选择最新数据，想要附加一列以返回数据的日期

Question

我最初有一个看起来像这样的数据框：

                                  industry    population %of rural land
        country       date        
        Australia     2017-01-01  NaN         NaN        NaN
                      2016-01-01  24.327571   18.898304  12
                      2015-01-01  25.396251   18.835267  12
                      2014-01-01  27.277007   18.834835  13
        United States 2017-01-01  NaN         NaN        NaN
                      2016-01-01  NaN         19.028231  NaN
                      2015-01-01  20.027274   19.212860  NaN
                      2014-01-01  20.867359   19.379071  NaN

我应用了以下代码，该代码为每个国家/地区的每一列提取了最新数据，并得到了以下数据集：

df = df.groupby(level=0).first()

               industry  population  %of rural land
country                             
Australia      24.327571   18.898304 12
United States  20.027274   19.028231 NaN

有什么方法可以添加显示数据年份的列？ 如果同一国家/地区的年份不同，则返回新数据框中最早的数据年份？ 因此，对于澳大利亚，这将是2016年，而对于美国将是2015年。理想情况下，数据框应如下所示：

               year      industry  population  %of rural land
country                             
Australia      2016      24.327571   18.898304 12
United States  2015      20.027274   19.028231 NaN

Answer 1

我认为您需要为first年的非NaN行创建dropna helper Series ，然后：

s = df.dropna().reset_index(level=1)['date'].dt.year.groupby(level=0).first()
df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
               year   industry  population
country                                   
Australia      2016  24.327571   18.898304
United States  2015  20.027274   19.028231

另一种解决方案，在date列中添加NaNs ，最后按dt.year获取年：

df1 = (df.reset_index(level=1)
        .assign(date=lambda x: x['date'].where(df.notnull().all(1).values))
        .groupby(level=0).first()
        .assign(date=lambda x: x['date'].dt.year)
        .rename(columns={'date':'year'}))
print (df1)
               year   industry  population
country                                   
Australia      2016  24.327571   18.898304
United States  2015  20.027274   19.028231

编辑：

def f(x):
    #check NaNs
    m = x.isnull()
    #remove all NaNs columns 
    m = m.loc[:, ~m.all()]
    #first index value of non NaNs rows
    m = m[~m.any(1)].index[0][1].year
    return (m)

s = df.groupby(level=0).apply(f)
print (s)
country
Australia        2016
United States    2015
dtype: int64

df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
               year   industry  population  %of rural land
country                                                   
Australia      2016  24.327571   18.898304            12.0
United States  2015  20.027274   19.028231             NaN

使用groupby选择最新数据，想要附加一列以返回数据的日期

问题描述

1 个解决方案

解决方案1
0 2017-12-05 19:16:47

使用groupby选择最新数据，想要附加一列以返回数据的日期

问题描述

1 个解决方案

解决方案1 0 2017-12-05 19:16:47

解决方案1
0 2017-12-05 19:16:47