简体   繁体   English

使用groupby选择最新数据,想要附加一列以返回数据的日期

[英]Used groupby to select most recent data, want to append a column that returns the date of the data

I originally had a dataframe that looked like this: 我最初有一个看起来像这样的数据框:

                                  industry    population %of rural land
        country       date        
        Australia     2017-01-01  NaN         NaN        NaN
                      2016-01-01  24.327571   18.898304  12
                      2015-01-01  25.396251   18.835267  12
                      2014-01-01  27.277007   18.834835  13
        United States 2017-01-01  NaN         NaN        NaN
                      2016-01-01  NaN         19.028231  NaN
                      2015-01-01  20.027274   19.212860  NaN
                      2014-01-01  20.867359   19.379071  NaN

I applied the following code which pulled the most recent data for each of the columns for each of the countries and resulted in the following dataset: 我应用了以下代码,该代码为每个国家/地区的每一列提取了最新数据,并得到了以下数据集:

df = df.groupby(level=0).first()

               industry  population  %of rural land
country                             
Australia      24.327571   18.898304 12
United States  20.027274   19.028231 NaN

Is there any way to add a column that shows the year of the data as well? 有什么方法可以添加显示数据年份的列? and in the case where the year is different for the same country to return the oldest year of the data in the new data frame? 如果同一国家/地区的年份不同,则返回新数据框中最早的数据年份? So for Australia, that would be 2016 and US that would be 2015. Ideally, the dataframe would look like this: 因此,对于澳大利亚,这将是2016年,而对于美国将是2015年。理想情况下,数据框应如下所示:

               year      industry  population  %of rural land
country                             
Australia      2016      24.327571   18.898304 12
United States  2015      20.027274   19.028231 NaN

I think you need for first year of non NaN s rows create helper Series by dropna and then : 我认为您需要为first年的非NaN行创建dropna helper Series ,然后:

s = df.dropna().reset_index(level=1)['date'].dt.year.groupby(level=0).first()
df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
               year   industry  population
country                                   
Australia      2016  24.327571   18.898304
United States  2015  20.027274   19.028231

Another solution with add NaNs to date column and last get years by dt.year : 另一种解决方案,在date列中添加NaNs ,最后按dt.year获取年:

df1 = (df.reset_index(level=1)
        .assign(date=lambda x: x['date'].where(df.notnull().all(1).values))
        .groupby(level=0).first()
        .assign(date=lambda x: x['date'].dt.year)
        .rename(columns={'date':'year'}))
print (df1)
               year   industry  population
country                                   
Australia      2016  24.327571   18.898304
United States  2015  20.027274   19.028231

EDIT: 编辑:

def f(x):
    #check NaNs
    m = x.isnull()
    #remove all NaNs columns 
    m = m.loc[:, ~m.all()]
    #first index value of non NaNs rows
    m = m[~m.any(1)].index[0][1].year
    return (m)

s = df.groupby(level=0).apply(f)
print (s)
country
Australia        2016
United States    2015
dtype: int64

df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
               year   industry  population  %of rural land
country                                                   
Australia      2016  24.327571   18.898304            12.0
United States  2015  20.027274   19.028231             NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Pandas 数据框中创建包含最新数据的列 - How to create a column with the most recent data in a Pandas Data Frame SQLAlchemy:选择行,按日期列分组,其日期时间列对于每个日期都是最新的 - SQLAlchemy: Select rows, grouped by a date column, whose datetime column is most recent for each date 在 dataframe 列中查找第二个最近的日期 - Find second most recent date in a dataframe column 使用 Python 的 pandas,拆分日期并选择最近的日期 - Using Python's pandas, split the date and select the most recent date Python:在一列中查找最近的日期,而另一列中没有匹配的日期 - Python: find most recent date in one column with no matching date in another 使用Seaborn散点图绘制最新数据点 - Plotting the most recent data points with Seaborn scatterplot 合并两个 pandas 数据帧,一个日期不频繁,应按最近日期合并 - Merge two pandas data frames, one has infrequent dates and should be merged by the most recent date 使用一列月份过滤熊猫数据框以保留最近的 n 个月 - Filtering a pandas data frame using a column of months to keep the most recent n months SQL 语句根据 id 选择最近的日期和组 - SQL statement select most recent date and group per id 另一列的每个唯一值的30个最新数据点的平均值 - Mean of 30 most recent data points for each unique value of another column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM