简体   繁体   English

从 Pandas df 列名和值创建新列

[英]Creating new columns from pandas df column names and values

I have a dataframe of the form:我有一个如下形式的数据框:

basket            fruit_apple   fruit_pear   fruit_cherry
basket_id_1       5             NaN          6
basket_id_2       NaN           1            NaN

and I want to create two new columns that should look like this:我想创建两个新列,应该如下所示:

basket          fruit_type   number
basket_id_1     apple        5
basket_id_1     pear         NaN
basket_id_1     cherry       6
basket_id_2     apple        NaN
basket_id_2     pear         1
basket_id_2     cherry       NaN

where the contents of fruit_type are determined automatically with a colname.split('_')[1] How can this be done in a concise way?其中 Fruit_type 的内容是用colname.split('_')[1]自动确定的,这如何以简洁的方式完成?

I think faster is change column names insted new column (because less number of values), so I suggest use rename in first step and reshape with DataFrame.melt , last change order by DataFrame.sort_values :我认为更快的是更改列名插入新列(因为值的数量较少),所以我建议在第一步使用rename并使用DataFrame.melt重塑,最后更改顺序由DataFrame.sort_values

df = (df.rename(columns= lambda x: x.split('_')[-1])
        .melt('basket', var_name='fruit_type', value_name='number')
        .sort_values('basket'))
print (df)
        basket fruit_type  number
0  basket_id_1      apple     5.0
2  basket_id_1       pear     NaN
4  basket_id_1     cherry     6.0
1  basket_id_2      apple     NaN
3  basket_id_2       pear     1.0
5  basket_id_2     cherry     NaN

Another idea is use DataFrame.set_index with DataFrame.stack :另一个想法是将DataFrame.set_indexDataFrame.stack DataFrame.set_index使用:

df = (df.set_index('basket')
        .rename(columns= lambda x: x.split('_')[-1])
        .rename_axis('fruit_type', axis=1)
        .stack(dropna=False)
        .reset_index(name='number')
        )
print (df)
        basket fruit_type  number
0  basket_id_1      apple     5.0
1  basket_id_1       pear     NaN
2  basket_id_1     cherry     6.0
3  basket_id_2      apple     NaN
4  basket_id_2       pear     1.0
5  basket_id_2     cherry     NaN

Transform from wide to long format,从宽格式转换为长格式,
create new columns by stripping 'fruit_' from fruit_type column通过从fruit_type 列中删除“fruit_”来创建新列
and sorting by 'basket' column并按“篮子”列排序

Edit: thanks to @jezrael, who pointed out the dangers of stripping (it will take out any of the alphabets, which will definitely affect scalability to other use cases), changed it to replace, which should be safer, since it works on regex and grouping.编辑:感谢@jezrael,他指出了剥离的危险(它会去掉任何字母,这肯定会影响其他用例的可扩展性),将其更改为替换,这应该更安全,因为它适用于正则表达式和分组。

(df.melt(id_vars='basket',
         var_name='fruit_type',
         value_name='number')
   .assign(fruit_type=lambda x: x.fruit_type.str.replace(r'fruit_',''))
   .sort_values('basket')
  )

    basket  fruit_type  number
0   basket_id_1 apple   5.0
2   basket_id_1 pear    NaN
4   basket_id_1 cherry  6.0
1   basket_id_2 apple   NaN
3   basket_id_2 pear    1.0
5   basket_id_2 cherry  NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM