I have a pandas dataframe in python, let's call it df
In this dataframe I create a new column based on an exist column as follows:
df.loc[:, 'new_col'] = df['col']
Then I do the following:
df[df['new_col']=='Above Average'] = 'Good'
However, I noticed that this operation also changes the values in df['col']
What should I do in order the values in df['col']
not to be affected by operations I do in df['new_col']
?
Use DataFrame.loc
with boolean indexing
:
df.loc[df['new_col']=='Above Average', 'new_col'] = 'Good'
If no column is specified, all columns are set to Good
by condition.
Also both line of code should be changed to one by numpy.where
or Series.mask
:
df['new_col'] = np.where(df['new_col']=='Above Average', 'Good', df['col'])
df['new_col'] = df['col'].mask(df['new_col']=='Above Average', 'Good')
EDIT: For change many values use Series.replace
or Series.map
with dictionary for specified values:
d = {'Good':['Above average','effective'], 'Very Good':['Really effective']}
#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
print (d1)
{'Above average': 'Good', 'effective': 'Good', 'Really effective': 'Very Good'}
df['new_col'] = df['col'].replace(d1)
#if large data obviously better performance
df['new_col'] = df['col'].map(d1).fillna(df['col'])
There is also an option to use dataframe where
method:
df['new_col'] = df['col']
df['new_col'].where(df['new_col']!='Above Average', other='Good', inplace=True )
But to be clear np.where
is the fastest way to go:
m = df['col'] == 'Above Average'
df['new_column'] = np.where(m, 'Good', df['col'])
df['new_column']
is the new column name. If mask m
is True
df['col']
will be assigned else 'Good'
.
+----+---------------+
| | col |
|----+---------------|
| 0 | Nan |
| 1 | Above Average |
| 2 | 1.0 |
+----+---------------+
+----+---------------+--------------+
| | col | new_column |
|----+---------------+--------------|
| 0 | Nan | Nan |
| 1 | Above Average | Good |
| 2 | 1.0 | 1.0 |
+----+---------------+--------------+
I am also providing here some notes on masking when using the df.loc
:
m = df['col']=='Above Average'
print(m)
df.loc[m, 'new_column'] = 'Good'
As you may see the result will be the same, but note how mask m
is having the information where to read the value if m
is False
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.