[英]Create a new column from another column in Python
I have a pandas dataframe in python, let's call it df
我在python中有一个pandas数据框,我们称它为
df
In this dataframe I create a new column based on an exist column as follows: 在此数据帧中,我基于存在列创建一个新列,如下所示:
df.loc[:, 'new_col'] = df['col']
Then I do the following: 然后,我执行以下操作:
df[df['new_col']=='Above Average'] = 'Good'
However, I noticed that this operation also changes the values in df['col']
但是,我注意到此操作还会更改
df['col']
What should I do in order the values in df['col']
not to be affected by operations I do in df['new_col']
? 为了使
df['col']
的值不受我在df['new_col']
进行的操作的影响,我该怎么办?
Use DataFrame.loc
with boolean indexing
: 将
DataFrame.loc
与boolean indexing
一起使用:
df.loc[df['new_col']=='Above Average', 'new_col'] = 'Good'
If no column is specified, all columns are set to Good
by condition. 如果未指定任何列,
Good
条件将所有列设置为“ Good
。
Also both line of code should be changed to one by numpy.where
or Series.mask
: 同样,两行代码也应通过
numpy.where
或Series.mask
更改为Series.mask
:
df['new_col'] = np.where(df['new_col']=='Above Average', 'Good', df['col'])
df['new_col'] = df['col'].mask(df['new_col']=='Above Average', 'Good')
EDIT: For change many values use Series.replace
or Series.map
with dictionary for specified values: 编辑:要更改许多值,请使用带有字典的
Series.replace
或Series.map
作为指定值:
d = {'Good':['Above average','effective'], 'Very Good':['Really effective']}
#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
print (d1)
{'Above average': 'Good', 'effective': 'Good', 'Really effective': 'Very Good'}
df['new_col'] = df['col'].replace(d1)
#if large data obviously better performance
df['new_col'] = df['col'].map(d1).fillna(df['col'])
There is also an option to use dataframe where
method: 还有一个使用dataframe
where
方法的选项:
df['new_col'] = df['col']
df['new_col'].where(df['new_col']!='Above Average', other='Good', inplace=True )
But to be clear np.where
is the fastest way to go: 但是要明确
np.where
是最快的方法:
m = df['col'] == 'Above Average'
df['new_column'] = np.where(m, 'Good', df['col'])
df['new_column']
is the new column name. df['new_column']
是新的列名。 If mask m
is True
df['col']
will be assigned else 'Good'
. 如果mask
m
为True
df['col']
将被分配为'Good'
。
+----+---------------+
| | col |
|----+---------------|
| 0 | Nan |
| 1 | Above Average |
| 2 | 1.0 |
+----+---------------+
+----+---------------+--------------+
| | col | new_column |
|----+---------------+--------------|
| 0 | Nan | Nan |
| 1 | Above Average | Good |
| 2 | 1.0 | 1.0 |
+----+---------------+--------------+
I am also providing here some notes on masking when using the df.loc
: 我还在这里提供有关使用
df.loc
时的遮罩的注意事项:
m = df['col']=='Above Average'
print(m)
df.loc[m, 'new_column'] = 'Good'
As you may see the result will be the same, but note how mask m
is having the information where to read the value if m
is False
如您所见,结果将是相同的,但请注意,如果
m
为False
则掩码m
如何获得在何处读取值的信息
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.