简体   繁体   English

将 groupedby pandas 数据框(多列但不是所有列)从长转换为宽

[英]Transforming groupedby pandas dataframe (multiple but not all columns) from long to wide

The problem:问题:

I have a dataset with yearly data of different companies.我有一个包含不同公司年度数据的数据集。 The data is stored in a long format, each year is a row therefore company ids are duplicated.数据以长格式存储,每年都是一行,因此公司 ID 是重复的。 The data looks like this (however in the original dataframe I have lot more columns).数据看起来像这样(但是在原始数据框中我有更多的列)。

在此处输入图片说明

I would need to transform the long type format to wide type format, so each company will be shown in one row (no duplication)我需要将长型格式转换为宽型格式,因此每个公司将显示在一行中(无重复)

This is the result I would like to look like:这是我想要的结果:

在此处输入图片说明

As you can see I would need:如您所见,我需要:

  • some columns (like "year") are not needed any more不再需要某些列(如“年份”)

  • some columns (like "sales", "sales_change_in_2_years", "sales_change_over_year") should be transformed from wide format to long format and keeping their names (and adding a number to them)某些列(如“sales”、“sales_change_in_2_years”、“sales_change_over_year”)应从宽格式转换为长格式并保留其名称(并为其添加数字)

  • some columns (like "ind1" and "ind2") should remain as they are (no transformation from wide to long)某些列(如“ind1”和“ind2”)应保持原样(没有从宽到长的转换)

So far I was able to workout a solution which works only on one columns, so it is really not a solution for me.到目前为止,我能够解决一个只适用于一列的解决方案,所以它对我来说真的不是一个解决方案。

This is my code:这是我的代码:

test.groupby("comp_id")['sales_change_1'].apply(list).apply(pd.Series).rename(columns=lambda x: 'sales_{}'.format(x+1))

Is there a better solution to my problem?我的问题有更好的解决方案吗?

After you drop the years:在你放下岁月之后:

del test['Year']

You can manage to group the lines together by adding an extra column with the row "index" for each row belonging to the same company.您可以通过为属于同一公司的每一行添加一个带有“索引”行的额外列来设法将这些行组合在一起。

test['idx'] = test.groupby('Comp_id').cumcount() + 1

Then set it as part as the DataFrame index and use unstack() to turn it into columns.然后将其设置为 DataFrame 索引的一部分并使用 unstack unstack()将其转换为列。

test = test.set_index(['Comp_id', 'idx']).unstack()

At this point, your columns will be a MultiIndex with the created 'idx' as a second level, so you could already use the DataFrame as it stands referring to columns as ('Sales', 1) , ('Sales', 2) , etc.此时,您的列将是一个 MultiIndex,其中创建的'idx'作为第二级,因此您已经可以使用 DataFrame,因为它表示将列称为('Sales', 1) , ('Sales', 2) , 等等。

If you want to flatten your columns, using underscore as the separator, you can do so with:如果你想展平你的列,使用下划线作为分隔符,你可以这样做:

test.columns = ['{}_{}'.format(col, idx) for (col, idx) in test.columns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM