简体   繁体   中英

Pandas: Group by distinct values of each cell and split column into multiple columns

I want to group-by every distinct cell and splitting the 'Status'-column into multiple columns based on their distinct values. The values of the new column(s) shall have the sum of 'Count' based on the occurrences.

My data:

        Department    Age   Salary  Status  Count
0            Sales  31-35  46K-50K  Senior     30
1            Sales  26-30  26K-30K  Junior     40
2            Sales  31-35  31K-35K  Junior     40
3               IT  21-25  46K-50K  Junior     20
4               IT  31-35  66K-70K  Senior      5
5               IT  26-30  46K-50K  Junior      3
6               IT  41-45  66K-70K  Senior      3
7        Marketing  36-40  46K-50K  Senior     10
8        Marketing  31-35  41K-45K  Junior      4
9   Administration  46-50  36K-40K  Senior      4
10  Administration  26-30  26K-30K  Junior      6

I want it to be:

        Attribute  Junior  Senior
0  Administration       6       4
1              IT      23       8
2       Marketing       4      10
3           Sales      80      30
4           21-25      20       0
5           26-30      49       0
6           31-35      44      35
7           36-40       0      10
8           41-45       0       3
9           46-50       0       4
10        26K-30K      46       0
11        31K-35K      40       0
12        36K-40K       0       4
13        41K-45K       4       0
14        46K-50K      23      40
15        66K-70K       0       8

Code:

import pandas as pd

df = pd.DataFrame({'Department': ['Sales', 'Sales', 'Sales',
                                  'IT', 'IT', 'IT', 'IT',
                                  'Marketing', 'Marketing',
                                 'Administration', 'Administration'],
                   'Age': ['31-35', '26-30', '31-35', '21-25', '31-35', 
                           '26-30', '41-45', '36-40', '31-35', '46-50',
                           '26-30'],
                   'Salary': ['46K-50K', '26K-30K', '31K-35K', '46K-50K',
                              '66K-70K', '46K-50K', '66K-70K', '46K-50K',
                              '41K-45K', '36K-40K', '26K-30K'],
                   'Status': ['Senior', 'Junior', 'Junior', 'Junior',
                              'Senior', 'Junior', 'Senior', 'Senior',
                              'Junior', 'Senior', 'Junior'],
                   'Count': [30, 40, 40, 20, 5, 3, 3, 10, 4, 4, 6]},
                    columns=['Department', 'Age', 'Salary', 'Status', 
                    'Count'])

df2 = df.groupby(['Department']).Count.sum() 
    .append(df.groupby(['Age']).Count.sum()) 
    .append(df.groupby(['Salary']).Count.sum())

This gives me:

Administration     10
IT                 31
Marketing          14
Sales             110
21-25              20
26-30              49
31-35              79
36-40              10
41-45               3
46-50               4
26K-30K            46
31K-35K            40
36K-40K             4
41K-45K             4
46K-50K            63
66K-70K             8

I can't figure out how to do the split on 'Status' and I also think that there's a misuse of append() in here.

You're almost there.

cols = ['Department', 'Age', 'Salary']
parts = [df.groupby([col, 'Status']).Count.sum() for col in cols]
df2 = pd.concat(parts).unstack(fill_value=0)

I used pd.concat() instead of repeated append() because as you pointed out, append() is not very good (it's slow).

Splitting on Status is easy: just add it to groupby() and then unstack() it at the end to turn it into column rather than row labels.

Use pd.pivot_table() -

One-Liner

df_pivot = pd.pivot_table(df, columns=['Status'], index=['Department'], aggfunc=np.sum).fillna(0).append(pd.pivot_table(df, columns=['Status'], index=['Salary'],aggfunc=np.sum).fillna(0))

Output

                Count
Status         Junior Senior
Administration    6.0    4.0
IT               23.0    8.0
Marketing         4.0   10.0
Sales            80.0   30.0
26K-30K          46.0    0.0
31K-35K          40.0    0.0
36K-40K           0.0    4.0
41K-45K           4.0    0.0
46K-50K          23.0   40.0
66K-70K           0.0    8.0

Explanation

Pandas pivot table is a convenience function to output pivot tables like excel. So your output is a combination of generating a pivot on the Department columns then the Salary columns.

So you can start with generating separate pivot tables with index and columns specified and then use append to concatenate them.

df1=pd.pivot_table(df, columns=['Status'], index=['Department'], aggfunc=np.sum).fillna(0)

Then the Salary part -

df2=pd.pivot_table(df, columns=['Status'], index=['Salary'],aggfunc=np.sum).fillna(0)

Finally -

df3 = df1.append(df2)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM