简体   繁体   中英

Pandas creating a new variable based on two existing variables

I have the following code I think is highly inefficient. Is there a better way to do this type common recoding in pandas?

df['F'] = 0
df['F'][(df['B'] >=3) & (df['C'] >=4.35)] = 1
df['F'][(df['B'] >=3) & (df['C'] < 4.35)] = 2
df['F'][(df['B'] < 3) & (df['C'] >=4.35)] = 3
df['F'][(df['B'] < 3) & (df['C'] < 4.35)] = 4

Use numpy.select and cache boolean masks to variables for better performance:

m1 = df['B'] >= 3
m2 = df['C'] >= 4.35
m3 = df['C'] < 4.35
m4 = df['B'] < 3

df['F'] = np.select([m1 & m2, m1 & m3, m4 & m2, m4 & m3], [1,2,3,4], default=0)

In your specific case, you can make use of the fact that booleans are actually integers (False == 0, True == 1) and use simple arithmetic:

df['F'] = 1 + (df['C'] < 4.35) + 2 * (df['B'] < 3)

Note that this will ignore any NaN's in your B and C columns, these will be assigned as being above your limit.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM