使用groupby选择最新数据，想要附加一列以返回数据的日期

Question

I originally had a dataframe that looked like this: 我最初有一个看起来像这样的数据框：

                                  industry    population %of rural land
        country       date        
        Australia     2017-01-01  NaN         NaN        NaN
                      2016-01-01  24.327571   18.898304  12
                      2015-01-01  25.396251   18.835267  12
                      2014-01-01  27.277007   18.834835  13
        United States 2017-01-01  NaN         NaN        NaN
                      2016-01-01  NaN         19.028231  NaN
                      2015-01-01  20.027274   19.212860  NaN
                      2014-01-01  20.867359   19.379071  NaN

I applied the following code which pulled the most recent data for each of the columns for each of the countries and resulted in the following dataset: 我应用了以下代码，该代码为每个国家/地区的每一列提取了最新数据，并得到了以下数据集：

df = df.groupby(level=0).first()

               industry  population  %of rural land
country                             
Australia      24.327571   18.898304 12
United States  20.027274   19.028231 NaN

Is there any way to add a column that shows the year of the data as well? 有什么方法可以添加显示数据年份的列？ and in the case where the year is different for the same country to return the oldest year of the data in the new data frame? 如果同一国家/地区的年份不同，则返回新数据框中最早的数据年份？ So for Australia, that would be 2016 and US that would be 2015. Ideally, the dataframe would look like this: 因此，对于澳大利亚，这将是2016年，而对于美国将是2015年。理想情况下，数据框应如下所示：

               year      industry  population  %of rural land
country                             
Australia      2016      24.327571   18.898304 12
United States  2015      20.027274   19.028231 NaN

Answer 1

I think you need for first year of non NaN s rows create helper Series by dropna and then : 我认为您需要为first年的非NaN行创建dropna helper Series ，然后：

s = df.dropna().reset_index(level=1)['date'].dt.year.groupby(level=0).first()
df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
               year   industry  population
country                                   
Australia      2016  24.327571   18.898304
United States  2015  20.027274   19.028231

Another solution with add NaNs to date column and last get years by dt.year : 另一种解决方案，在date列中添加NaNs ，最后按dt.year获取年：

df1 = (df.reset_index(level=1)
        .assign(date=lambda x: x['date'].where(df.notnull().all(1).values))
        .groupby(level=0).first()
        .assign(date=lambda x: x['date'].dt.year)
        .rename(columns={'date':'year'}))
print (df1)
               year   industry  population
country                                   
Australia      2016  24.327571   18.898304
United States  2015  20.027274   19.028231

EDIT: 编辑：

def f(x):
    #check NaNs
    m = x.isnull()
    #remove all NaNs columns 
    m = m.loc[:, ~m.all()]
    #first index value of non NaNs rows
    m = m[~m.any(1)].index[0][1].year
    return (m)

s = df.groupby(level=0).apply(f)
print (s)
country
Australia        2016
United States    2015
dtype: int64

df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
               year   industry  population  %of rural land
country                                                   
Australia      2016  24.327571   18.898304            12.0
United States  2015  20.027274   19.028231             NaN

使用groupby选择最新数据，想要附加一列以返回数据的日期

问题描述

1 个解决方案

解决方案1
0 2017-12-05 19:16:47

使用groupby选择最新数据，想要附加一列以返回数据的日期

问题描述

1 个解决方案

解决方案1 0 2017-12-05 19:16:47

解决方案1
0 2017-12-05 19:16:47