简体   繁体   中英

Pandas, merge multiple dummy variables into one column by name

I have a datafile with one VALUE column and multiple dummy variables representing TYPES. I have copied a short example below. I need the average of each type (which I can get) with a column with the named type (which I don't seem to be able to get). Pointers would be welcome.

import pandas as pd
data = {'salary' : [50000, 45000, 55000, 40000, 35000, 45000, 30000,25000,35000],
        'manager': [1,1,1,0,0,0,0,0,0], 'foreman': [0,0,0,1,1,1,0,0,0], 
        'worker': [0,0,0,0,0,0,1,1,1]}
df = pd.DataFrame(data=data)
df

This is my input data.

salary  manager foreman worker
0   50000   1   0   0
1   45000   1   0   0
2   55000   1   0   0
3   40000   0   1   0
4   35000   0   1   0
5   45000   0   1   0
6   30000   0   0   1
7   25000   0   0   1
8   35000   0   0   1

I can get the average, like this, but not consolidate the three dummy vars into one categorical column:

print(df.groupby(['manager','foreman','worker']).mean().reset_index())

manager  foreman  worker  salary
0        0        0       1   30000
1        0        1       0   40000
2        1        0       0   50000

I would like to have something that looks like this:

need = {'salary' : [50000, 45000, 55000, 40000, 35000, 45000, 30000,25000,35000],
        'type': ['manager','manager','manager','foreman','foreman','foreman','worker','worker','worker']}
df2 = pd.DataFrame(data=need)
df2

salary  type
0   50000   manager
1   45000   manager
2   55000   manager
3   40000   foreman
4   35000   foreman
5   45000   foreman
6   30000   worker
7   25000   worker
8   35000   worker

I can do this simple example by hand. The result looks like this, which is ultimately where I will end up:

pay = {'type' : ['manager','foreman','worker'], 'avg_pay': [50000,40000,30000]}
df1 = pd.DataFrame(data=pay)
df1

type    avg_pay
0   manager 50000
1   foreman 40000
2   worker  30000

Can't seem to find any documentation on how to "undummy" variables. How do I do this?

If you always have a maximum of one 1 per row, pandas 1.5 introduced the pandas.from_dummies function:

df[['salary']].join(pd.from_dummies(df.drop(columns='salary'))[''].rename('type'))

# or
pd.DataFrame({'salary': df['salary'],
              'type': pd.from_dummies(df.drop(columns='salary'))['']})

NB. if one of the rows can have only 0s, add the default_category=np.nan parameter to from_dummies .

Output:

   salary     type
0   50000  manager
1   45000  manager
2   55000  manager
3   40000  foreman
4   35000  foreman
5   45000  foreman
6   30000   worker
7   25000   worker
8   35000   worker

Solutions if always only one 1 per row:

Use DataFrame.melt with ignore_index=False and var_name='type' parameter, last filter by 1 in DataFrame.loc with DataFrame.pop for remove column value :

df = (df.melt('salary', ignore_index=False, var_name='type')
        .loc[lambda x: x.pop('value').eq(1)])
print (df)
   salary     type
0   50000  manager
1   45000  manager
2   55000  manager
3   40000  foreman
4   35000  foreman
5   45000  foreman
6   30000   worker
7   25000   worker
8   35000   worker

Or:

s = df.drop('salary', axis=1).stack()
df = df[['salary']].join(s[s.eq(1)].index.to_frame().droplevel(1)[1].rename('type'))
print (df)
   salary     type
0   50000  manager
1   45000  manager
2   55000  manager
3   40000  foreman
4   35000  foreman
5   45000  foreman
6   30000   worker
7   25000   worker
8   35000   worker
    

Solution if possible multiple 1 per rows with DataFrame.dot by columns names for separate values by , :

df1 = df.set_index('salary')
df = df1.eq(1).dot(df1.columns + ',').str[:-1].reset_index(name='type')
print (df)
   salary     type
0   50000  manager
1   45000  manager
2   55000  manager
3   40000  foreman
4   35000  foreman
5   45000  foreman
6   30000   worker
7   25000   worker
8   35000   worker

Testing data with multiple 1 :

data = {'salary' : [50000, 45000, 55000, 40000, 35000, 45000, 30000,25000,35000],
        'manager': [1,1,1,0,1,0,0,0,0], 'foreman': [0,0,0,1,1,1,0,0,0], 
        'worker': [1,0,0,1,0,0,1,1,1]}
df = pd.DataFrame(data=data)
print (df)
   salary  manager  foreman  worker
0   50000        1        0       1
1   45000        1        0       0
2   55000        1        0       0
3   40000        0        1       1
4   35000        1        1       0
5   45000        0        1       0
6   30000        0        0       1
7   25000        0        0       1
8   35000        0        0       1

df1 = df.set_index('salary')
df = df1.eq(1).dot(df1.columns + ',').str[:-1].reset_index(name='type')
print (df)
   salary             type
0   50000   manager,worker
1   45000          manager
2   55000          manager
3   40000   foreman,worker
4   35000  manager,foreman
5   45000          foreman
6   30000           worker
7   25000           worker
8   35000           worker

We can use pd.DataFrame.pipe for this purpose. Special thanks to jezrael for inspiration we can make the code more concise:

(df.melt(id_vars='salary', var_name='type')
 .pipe(lambda d: d.loc[d['value'].eq(1)].drop('value', axis=1)))

    salary     type
0    50000  manager
1    45000  manager
2    55000  manager
12   40000  foreman
13   35000  foreman
14   45000  foreman
24   30000   worker
25   25000   worker
26   35000   worker

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM