[英]Transforming groupedby pandas dataframe (multiple but not all columns) from long to wide
The problem:问题:
I have a dataset with yearly data of different companies.我有一个包含不同公司年度数据的数据集。 The data is stored in a long format, each year is a row therefore company ids are duplicated.
数据以长格式存储,每年都是一行,因此公司 ID 是重复的。 The data looks like this (however in the original dataframe I have lot more columns).
数据看起来像这样(但是在原始数据框中我有更多的列)。
I would need to transform the long type format to wide type format, so each company will be shown in one row (no duplication)我需要将长型格式转换为宽型格式,因此每个公司将显示在一行中(无重复)
This is the result I would like to look like:这是我想要的结果:
As you can see I would need:如您所见,我需要:
some columns (like "year") are not needed any more不再需要某些列(如“年份”)
some columns (like "sales", "sales_change_in_2_years", "sales_change_over_year") should be transformed from wide format to long format and keeping their names (and adding a number to them)某些列(如“sales”、“sales_change_in_2_years”、“sales_change_over_year”)应从宽格式转换为长格式并保留其名称(并为其添加数字)
some columns (like "ind1" and "ind2") should remain as they are (no transformation from wide to long)某些列(如“ind1”和“ind2”)应保持原样(没有从宽到长的转换)
So far I was able to workout a solution which works only on one columns, so it is really not a solution for me.到目前为止,我能够解决一个只适用于一列的解决方案,所以它对我来说真的不是一个解决方案。
This is my code:这是我的代码:
test.groupby("comp_id")['sales_change_1'].apply(list).apply(pd.Series).rename(columns=lambda x: 'sales_{}'.format(x+1))
Is there a better solution to my problem?我的问题有更好的解决方案吗?
After you drop the years:在你放下岁月之后:
del test['Year']
You can manage to group the lines together by adding an extra column with the row "index" for each row belonging to the same company.您可以通过为属于同一公司的每一行添加一个带有“索引”行的额外列来设法将这些行组合在一起。
test['idx'] = test.groupby('Comp_id').cumcount() + 1
Then set it as part as the DataFrame index and use unstack()
to turn it into columns.然后将其设置为 DataFrame 索引的一部分并使用 unstack
unstack()
将其转换为列。
test = test.set_index(['Comp_id', 'idx']).unstack()
At this point, your columns will be a MultiIndex with the created 'idx'
as a second level, so you could already use the DataFrame as it stands referring to columns as ('Sales', 1)
, ('Sales', 2)
, etc.此时,您的列将是一个 MultiIndex,其中创建的
'idx'
作为第二级,因此您已经可以使用 DataFrame,因为它表示将列称为('Sales', 1)
, ('Sales', 2)
, 等等。
If you want to flatten your columns, using underscore as the separator, you can do so with:如果你想展平你的列,使用下划线作为分隔符,你可以这样做:
test.columns = ['{}_{}'.format(col, idx) for (col, idx) in test.columns]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.