[英]How to create a column with the most recent data in a Pandas Data Frame
[英]Used groupby to select most recent data, want to append a column that returns the date of the data
我最初有一个看起来像这样的数据框:
industry population %of rural land
country date
Australia 2017-01-01 NaN NaN NaN
2016-01-01 24.327571 18.898304 12
2015-01-01 25.396251 18.835267 12
2014-01-01 27.277007 18.834835 13
United States 2017-01-01 NaN NaN NaN
2016-01-01 NaN 19.028231 NaN
2015-01-01 20.027274 19.212860 NaN
2014-01-01 20.867359 19.379071 NaN
我应用了以下代码,该代码为每个国家/地区的每一列提取了最新数据,并得到了以下数据集:
df = df.groupby(level=0).first()
industry population %of rural land
country
Australia 24.327571 18.898304 12
United States 20.027274 19.028231 NaN
有什么方法可以添加显示数据年份的列? 如果同一国家/地区的年份不同,则返回新数据框中最早的数据年份? 因此,对于澳大利亚,这将是2016年,而对于美国将是2015年。理想情况下,数据框应如下所示:
year industry population %of rural land
country
Australia 2016 24.327571 18.898304 12
United States 2015 20.027274 19.028231 NaN
我认为您需要为first
年的非NaN
行创建dropna
helper Series
,然后:
s = df.dropna().reset_index(level=1)['date'].dt.year.groupby(level=0).first()
df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
year industry population
country
Australia 2016 24.327571 18.898304
United States 2015 20.027274 19.028231
另一种解决方案,在date
列中添加NaNs
,最后按dt.year
获取年:
df1 = (df.reset_index(level=1)
.assign(date=lambda x: x['date'].where(df.notnull().all(1).values))
.groupby(level=0).first()
.assign(date=lambda x: x['date'].dt.year)
.rename(columns={'date':'year'}))
print (df1)
year industry population
country
Australia 2016 24.327571 18.898304
United States 2015 20.027274 19.028231
编辑:
def f(x):
#check NaNs
m = x.isnull()
#remove all NaNs columns
m = m.loc[:, ~m.all()]
#first index value of non NaNs rows
m = m[~m.any(1)].index[0][1].year
return (m)
s = df.groupby(level=0).apply(f)
print (s)
country
Australia 2016
United States 2015
dtype: int64
df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
year industry population %of rural land
country
Australia 2016 24.327571 18.898304 12.0
United States 2015 20.027274 19.028231 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.