[英]Creating new columns from pandas df column names and values
I have a dataframe of the form:我有一个如下形式的数据框:
basket fruit_apple fruit_pear fruit_cherry
basket_id_1 5 NaN 6
basket_id_2 NaN 1 NaN
and I want to create two new columns that should look like this:我想创建两个新列,应该如下所示:
basket fruit_type number
basket_id_1 apple 5
basket_id_1 pear NaN
basket_id_1 cherry 6
basket_id_2 apple NaN
basket_id_2 pear 1
basket_id_2 cherry NaN
where the contents of fruit_type are determined automatically with a colname.split('_')[1]
How can this be done in a concise way?其中 Fruit_type 的内容是用
colname.split('_')[1]
自动确定的,这如何以简洁的方式完成?
I think faster is change column names insted new column (because less number of values), so I suggest use rename
in first step and reshape with DataFrame.melt
, last change order by DataFrame.sort_values
:我认为更快的是更改列名插入新列(因为值的数量较少),所以我建议在第一步使用
rename
并使用DataFrame.melt
重塑,最后更改顺序由DataFrame.sort_values
:
df = (df.rename(columns= lambda x: x.split('_')[-1])
.melt('basket', var_name='fruit_type', value_name='number')
.sort_values('basket'))
print (df)
basket fruit_type number
0 basket_id_1 apple 5.0
2 basket_id_1 pear NaN
4 basket_id_1 cherry 6.0
1 basket_id_2 apple NaN
3 basket_id_2 pear 1.0
5 basket_id_2 cherry NaN
Another idea is use DataFrame.set_index
with DataFrame.stack
:另一个想法是将
DataFrame.set_index
与DataFrame.stack
DataFrame.set_index
使用:
df = (df.set_index('basket')
.rename(columns= lambda x: x.split('_')[-1])
.rename_axis('fruit_type', axis=1)
.stack(dropna=False)
.reset_index(name='number')
)
print (df)
basket fruit_type number
0 basket_id_1 apple 5.0
1 basket_id_1 pear NaN
2 basket_id_1 cherry 6.0
3 basket_id_2 apple NaN
4 basket_id_2 pear 1.0
5 basket_id_2 cherry NaN
Transform from wide to long format,从宽格式转换为长格式,
create new columns by stripping 'fruit_' from fruit_type column通过从fruit_type 列中删除“fruit_”来创建新列
and sorting by 'basket' column并按“篮子”列排序
Edit: thanks to @jezrael, who pointed out the dangers of stripping (it will take out any of the alphabets, which will definitely affect scalability to other use cases), changed it to replace, which should be safer, since it works on regex and grouping.编辑:感谢@jezrael,他指出了剥离的危险(它会去掉任何字母,这肯定会影响其他用例的可扩展性),将其更改为替换,这应该更安全,因为它适用于正则表达式和分组。
(df.melt(id_vars='basket',
var_name='fruit_type',
value_name='number')
.assign(fruit_type=lambda x: x.fruit_type.str.replace(r'fruit_',''))
.sort_values('basket')
)
basket fruit_type number
0 basket_id_1 apple 5.0
2 basket_id_1 pear NaN
4 basket_id_1 cherry 6.0
1 basket_id_2 apple NaN
3 basket_id_2 pear 1.0
5 basket_id_2 cherry NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.