[英]Create a new column from another column in Python
我在python中有一个pandas数据框,我们称它为df
在此数据帧中,我基于存在列创建一个新列,如下所示:
df.loc[:, 'new_col'] = df['col']
然后,我执行以下操作:
df[df['new_col']=='Above Average'] = 'Good'
但是,我注意到此操作还会更改df['col']
为了使df['col']
的值不受我在df['new_col']
进行的操作的影响,我该怎么办?
将DataFrame.loc
与boolean indexing
一起使用:
df.loc[df['new_col']=='Above Average', 'new_col'] = 'Good'
如果未指定任何列, Good
条件将所有列设置为“ Good
。
同样,两行代码也应通过numpy.where
或Series.mask
更改为Series.mask
:
df['new_col'] = np.where(df['new_col']=='Above Average', 'Good', df['col'])
df['new_col'] = df['col'].mask(df['new_col']=='Above Average', 'Good')
编辑:要更改许多值,请使用带有字典的Series.replace
或Series.map
作为指定值:
d = {'Good':['Above average','effective'], 'Very Good':['Really effective']}
#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
print (d1)
{'Above average': 'Good', 'effective': 'Good', 'Really effective': 'Very Good'}
df['new_col'] = df['col'].replace(d1)
#if large data obviously better performance
df['new_col'] = df['col'].map(d1).fillna(df['col'])
还有一个使用dataframe where
方法的选项:
df['new_col'] = df['col']
df['new_col'].where(df['new_col']!='Above Average', other='Good', inplace=True )
但是要明确np.where
是最快的方法:
m = df['col'] == 'Above Average'
df['new_column'] = np.where(m, 'Good', df['col'])
df['new_column']
是新的列名。 如果mask m
为True
df['col']
将被分配为'Good'
。
+----+---------------+
| | col |
|----+---------------|
| 0 | Nan |
| 1 | Above Average |
| 2 | 1.0 |
+----+---------------+
+----+---------------+--------------+
| | col | new_column |
|----+---------------+--------------|
| 0 | Nan | Nan |
| 1 | Above Average | Good |
| 2 | 1.0 | 1.0 |
+----+---------------+--------------+
我还在这里提供有关使用df.loc
时的遮罩的注意事项:
m = df['col']=='Above Average'
print(m)
df.loc[m, 'new_column'] = 'Good'
如您所见,结果将是相同的,但请注意,如果m
为False
则掩码m
如何获得在何处读取值的信息
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.