[英]Row-wise operation in Pandas data frame
我有一个具有这种格式的世界指标数据集
country year indicatorName value
USA 1970 Agricultural Land ...
USA 1970 Crop production ...
...
USA 2000 Agricultural Land ...
USA 2000 Crop production ...
...
Mexico 1970 Agricultural Land ...
Mexico 1970 Crop production ...
...
Mexico 2000 Agricultural Land ...
Mexico 2000 Crop production ...
这里有指标,我没有包括,但是这两个都是我很感兴趣,我想对应的划分value
的Crop production
对Agricultural Land
每个country
每year
。 让我们将结果命名为crop_prod_density
。
我不知道如何着手
df.groupby(['country', 'year'])
如何从此处执行操作以产生以下输出:
country year indicatorName value USA 1970 Agricultural Land ... USA 1970 Crop production ... USA 1970 crop_prod_density ...
country year indicatorName value crop_prod_density USA 1970 Agricultural Land ... us_value_1970 USA 1970 Crop production ... us_value_1970 ... Mexico 2000 Agricultural Land ... mx_value_2000 Mexico 2000 Crop production ... mx_value_2000
country year crop_prod_density USA 1970 us_value_1970 ... USA 2000 us_value_2000 ... Mexico 1970 mx_value_1970 ... Mexico 2000 mx_value_2000
您可以通过先重塑set_index
与unstack
,然后通过分div
:
print (df)
country year indicatorName value
0 USA 1970 Agricultural Land 10
1 USA 1970 Crop production 2
2 USA 2000 Agricultural Land 10
3 USA 2000 Crop production 3
4 Mexico 1970 Agricultural Land 10
5 Mexico 1970 Crop production 5
6 Mexico 2000 Agricultural Land 10
7 Mexico 2000 Crop production 4
df = (df.set_index(['country','year','indicatorName'])['value']
.unstack()
.assign(crop_prod_density=lambda x: x['Crop production'].div(x['Agricultural Land'])))
print (df)
indicatorName Agricultural Land Crop production crop_prod_density
country year
Mexico 1970 10 5 0.5
2000 10 4 0.4
USA 1970 10 2 0.2
2000 10 3 0.3
然后通过stack
重塑形状:
df1 = df.stack().reset_index(name='value')
print (df1)
country year indicatorName value
0 Mexico 1970 Agricultural Land 10.0
1 Mexico 1970 Crop production 5.0
2 Mexico 1970 crop_prod_density 0.5
3 Mexico 2000 Agricultural Land 10.0
4 Mexico 2000 Crop production 4.0
5 Mexico 2000 crop_prod_density 0.4
6 USA 1970 Agricultural Land 10.0
7 USA 1970 Crop production 2.0
8 USA 1970 crop_prod_density 0.2
9 USA 2000 Agricultural Land 10.0
10 USA 2000 Crop production 3.0
11 USA 2000 crop_prod_density 0.3
对于将新列添加到原始列,将新列添加到索引新列,但是最后必须通过reindex
更改列的顺序:
df2 =(df.set_index(['crop_prod_density'], append=True)
.stack()
.reset_index(name='value')
.reindex(columns=['country','year','indicatorName','value','crop_prod_density']))
print (df2)
country year indicatorName value crop_prod_density
0 Mexico 1970 Agricultural Land 10 0.5
1 Mexico 1970 Crop production 5 0.5
2 Mexico 2000 Agricultural Land 10 0.4
3 Mexico 2000 Crop production 4 0.4
4 USA 1970 Agricultural Land 10 0.2
5 USA 1970 Crop production 2 0.2
6 USA 2000 Agricultural Land 10 0.3
7 USA 2000 Crop production 3 0.3
最后删除不必要的列并从MultiIndex
创建列:
df3 = (df.drop(['Crop production','Agricultural Land'], axis=1)
.reset_index()
.rename_axis(None, 1))
print (df3)
country year crop_prod_density
0 Mexico 1970 0.5
1 Mexico 2000 0.4
2 USA 1970 0.2
3 USA 2000 0.3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.