I have a datafile with one VALUE column and multiple dummy variables representing TYPES. I have copied a short example below. I need the average of each type (which I can get) with a column with the named type (which I don't seem to be able to get). Pointers would be welcome.
import pandas as pd
data = {'salary' : [50000, 45000, 55000, 40000, 35000, 45000, 30000,25000,35000],
'manager': [1,1,1,0,0,0,0,0,0], 'foreman': [0,0,0,1,1,1,0,0,0],
'worker': [0,0,0,0,0,0,1,1,1]}
df = pd.DataFrame(data=data)
df
This is my input data.
salary manager foreman worker
0 50000 1 0 0
1 45000 1 0 0
2 55000 1 0 0
3 40000 0 1 0
4 35000 0 1 0
5 45000 0 1 0
6 30000 0 0 1
7 25000 0 0 1
8 35000 0 0 1
I can get the average, like this, but not consolidate the three dummy vars into one categorical column:
print(df.groupby(['manager','foreman','worker']).mean().reset_index())
manager foreman worker salary
0 0 0 1 30000
1 0 1 0 40000
2 1 0 0 50000
I would like to have something that looks like this:
need = {'salary' : [50000, 45000, 55000, 40000, 35000, 45000, 30000,25000,35000],
'type': ['manager','manager','manager','foreman','foreman','foreman','worker','worker','worker']}
df2 = pd.DataFrame(data=need)
df2
salary type
0 50000 manager
1 45000 manager
2 55000 manager
3 40000 foreman
4 35000 foreman
5 45000 foreman
6 30000 worker
7 25000 worker
8 35000 worker
I can do this simple example by hand. The result looks like this, which is ultimately where I will end up:
pay = {'type' : ['manager','foreman','worker'], 'avg_pay': [50000,40000,30000]}
df1 = pd.DataFrame(data=pay)
df1
type avg_pay
0 manager 50000
1 foreman 40000
2 worker 30000
Can't seem to find any documentation on how to "undummy" variables. How do I do this?
If you always have a maximum of one 1 per row, pandas 1.5 introduced the pandas.from_dummies
function:
df[['salary']].join(pd.from_dummies(df.drop(columns='salary'))[''].rename('type'))
# or
pd.DataFrame({'salary': df['salary'],
'type': pd.from_dummies(df.drop(columns='salary'))['']})
NB. if one of the rows can have only 0s, add the default_category=np.nan
parameter to from_dummies
.
Output:
salary type
0 50000 manager
1 45000 manager
2 55000 manager
3 40000 foreman
4 35000 foreman
5 45000 foreman
6 30000 worker
7 25000 worker
8 35000 worker
Solutions if always only one 1
per row:
Use DataFrame.melt
with ignore_index=False
and var_name='type'
parameter, last filter by 1
in DataFrame.loc
with DataFrame.pop
for remove column value
:
df = (df.melt('salary', ignore_index=False, var_name='type')
.loc[lambda x: x.pop('value').eq(1)])
print (df)
salary type
0 50000 manager
1 45000 manager
2 55000 manager
3 40000 foreman
4 35000 foreman
5 45000 foreman
6 30000 worker
7 25000 worker
8 35000 worker
Or:
s = df.drop('salary', axis=1).stack()
df = df[['salary']].join(s[s.eq(1)].index.to_frame().droplevel(1)[1].rename('type'))
print (df)
salary type
0 50000 manager
1 45000 manager
2 55000 manager
3 40000 foreman
4 35000 foreman
5 45000 foreman
6 30000 worker
7 25000 worker
8 35000 worker
Solution if possible multiple 1
per rows with DataFrame.dot
by columns names for separate values by ,
:
df1 = df.set_index('salary')
df = df1.eq(1).dot(df1.columns + ',').str[:-1].reset_index(name='type')
print (df)
salary type
0 50000 manager
1 45000 manager
2 55000 manager
3 40000 foreman
4 35000 foreman
5 45000 foreman
6 30000 worker
7 25000 worker
8 35000 worker
Testing data with multiple 1
:
data = {'salary' : [50000, 45000, 55000, 40000, 35000, 45000, 30000,25000,35000],
'manager': [1,1,1,0,1,0,0,0,0], 'foreman': [0,0,0,1,1,1,0,0,0],
'worker': [1,0,0,1,0,0,1,1,1]}
df = pd.DataFrame(data=data)
print (df)
salary manager foreman worker
0 50000 1 0 1
1 45000 1 0 0
2 55000 1 0 0
3 40000 0 1 1
4 35000 1 1 0
5 45000 0 1 0
6 30000 0 0 1
7 25000 0 0 1
8 35000 0 0 1
df1 = df.set_index('salary')
df = df1.eq(1).dot(df1.columns + ',').str[:-1].reset_index(name='type')
print (df)
salary type
0 50000 manager,worker
1 45000 manager
2 55000 manager
3 40000 foreman,worker
4 35000 manager,foreman
5 45000 foreman
6 30000 worker
7 25000 worker
8 35000 worker
We can use pd.DataFrame.pipe
for this purpose. Special thanks to jezrael for inspiration we can make the code more concise:
(df.melt(id_vars='salary', var_name='type')
.pipe(lambda d: d.loc[d['value'].eq(1)].drop('value', axis=1)))
salary type
0 50000 manager
1 45000 manager
2 55000 manager
12 40000 foreman
13 35000 foreman
14 45000 foreman
24 30000 worker
25 25000 worker
26 35000 worker
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.