简体   繁体   中英

How to create a new column in pandas based off boolean values from other columns?

With Pandas, I am using a data frame with a column that shows one's job and I wish to add another column that gives a 1 or a 0 based on whether the person is a manager. The actual data is far longer, so is there a way to use boolean logic to not have to put in the 1 or 0 manually? Below is what the desired output is...

import numpy as np
import pandas as pd

import numpy as np
import pandas as pd
job=pd.Series({'David':'manager','Keith':'player', 'Bob':'coach', 'Rick':'manger'})
is_manager=pd.Series({'David':'1','Keith':'0', 'Bob':'0', 'Rick':'1'})
data=pd.DataFrame({'job':job,'is_manager':is_manager})
print(data)

Compare column by Series.eq and then convert mask to 0, 1 by casting to integers , by Series.view or by numpy.where :

data=pd.DataFrame({'job':job})
data['is_manager'] = data['job'].eq('manager').astype(int)
data['is_manager'] = data['job'].eq('manager').view('i1')
data['is_manager'] = np.where(data['job'].eq('manager'), 1, 0)
print(data)
           job  is_manager
David  manager           1
Keith   player           0
Bob      coach           0
Rick    manger           0

Performance :

# 40k rows
data = pd.concat([data] * 10000, ignore_index=True)
print (data)



In [234]: %timeit data['is_manager'] = data['job'].eq('manager').astype(int)
2.93 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [235]: %timeit data['is_manager'] = data['job'].eq('manager').view('i1')
2.96 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [236]: %timeit data['is_manager'] = np.where(data['job'].eq('manager'), 1, 0)
2.89 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [237]: %timeit data['is_manager'] = data.apply(lambda row: 1 if row['job'] == 'manager' else 0, axis=1)
340 ms ± 8.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is a possible solution:

import numpy as np
import pandas as pd

job=pd.Series({'David':'manager','Keith':'player', 'Bob':'coach', 'Rick':'manger'})
data = pd.DataFrame({'job':job})

is_manager = data == "manager"
is_manager = is_manager.rename(columns={"job": "is_manager"})
data = data.join(is_manager)
data['is_manager'] = data.apply(lambda row: 1 if row['is_manager'] == True else 0, axis=1)
print(data)

Don't know how efficient this is but it works.

import numpy as np
import pandas as pd

job=pd.Series({'David':'manager','Keith':'player', 'Bob':'coach', 'Rick':'manager'})

data=pd.DataFrame({'job':job})

data['is_manager'] = data.apply(lambda row: 1 if row['job'] == 'manager' else 0, axis=1)

print(data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM