[英]Used groupby to select most recent data, want to append a column that returns the date of the data
I originally had a dataframe that looked like this: 我最初有一个看起来像这样的数据框:
industry population %of rural land
country date
Australia 2017-01-01 NaN NaN NaN
2016-01-01 24.327571 18.898304 12
2015-01-01 25.396251 18.835267 12
2014-01-01 27.277007 18.834835 13
United States 2017-01-01 NaN NaN NaN
2016-01-01 NaN 19.028231 NaN
2015-01-01 20.027274 19.212860 NaN
2014-01-01 20.867359 19.379071 NaN
I applied the following code which pulled the most recent data for each of the columns for each of the countries and resulted in the following dataset: 我应用了以下代码,该代码为每个国家/地区的每一列提取了最新数据,并得到了以下数据集:
df = df.groupby(level=0).first()
industry population %of rural land
country
Australia 24.327571 18.898304 12
United States 20.027274 19.028231 NaN
Is there any way to add a column that shows the year of the data as well? 有什么方法可以添加显示数据年份的列? and in the case where the year is different for the same country to return the oldest year of the data in the new data frame?
如果同一国家/地区的年份不同,则返回新数据框中最早的数据年份? So for Australia, that would be 2016 and US that would be 2015. Ideally, the dataframe would look like this:
因此,对于澳大利亚,这将是2016年,而对于美国将是2015年。理想情况下,数据框应如下所示:
year industry population %of rural land
country
Australia 2016 24.327571 18.898304 12
United States 2015 20.027274 19.028231 NaN
I think you need for first
year of non NaN
s rows create helper Series
by dropna
and then : 我认为您需要为
first
年的非NaN
行创建dropna
helper Series
,然后:
s = df.dropna().reset_index(level=1)['date'].dt.year.groupby(level=0).first()
df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
year industry population
country
Australia 2016 24.327571 18.898304
United States 2015 20.027274 19.028231
Another solution with add NaNs
to date
column and last get years by dt.year
: 另一种解决方案,在
date
列中添加NaNs
,最后按dt.year
获取年:
df1 = (df.reset_index(level=1)
.assign(date=lambda x: x['date'].where(df.notnull().all(1).values))
.groupby(level=0).first()
.assign(date=lambda x: x['date'].dt.year)
.rename(columns={'date':'year'}))
print (df1)
year industry population
country
Australia 2016 24.327571 18.898304
United States 2015 20.027274 19.028231
EDIT: 编辑:
def f(x):
#check NaNs
m = x.isnull()
#remove all NaNs columns
m = m.loc[:, ~m.all()]
#first index value of non NaNs rows
m = m[~m.any(1)].index[0][1].year
return (m)
s = df.groupby(level=0).apply(f)
print (s)
country
Australia 2016
United States 2015
dtype: int64
df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
year industry population %of rural land
country
Australia 2016 24.327571 18.898304 12.0
United States 2015 20.027274 19.028231 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.