[英]How to add new column based on row condition in pandas dataframe?
I want to add new column based on row condition which is based on two different columns of same dataframe. 我想基于基于同一数据帧的两个不同列的行条件添加新列。
I have below Dataframe - 我在Dataframe下面-
df1_data = {'e_id': {0:'101',1:'',2:'103',3:'',4:'105',5:'',6:''},
'r_id': {0:'',1:'502',2:'',3:'504',4:'',5:'506',6:''}}
df=pd.DataFrame(df1_data)
print df
I want to add new column named as "sym". 我想添加名为“ sym”的新列。
Condition - 条件-
I tried with below code - 我尝试了以下代码-
df1_data = {'e_id': {0:'101',1:'',2:'103',3:'',4:'105',5:''},
'r_id': {0:'',1:'502',2:'',3:'504',4:'',5:'506'}}
df=pd.DataFrame(df1_data)
print df
if df['e_id'].any():
df['sym'] = df['e_id']
print df
if df['r_id'].any():
df['sym'] = df['r_id']
print df
But it is giving me a wrong output. 但这给了我错误的输出。
Expected output - 预期产量-
e_id r_id sym
0 101 101
1 502 502
2 103 103
3 504 504
4 105 105
5 506 506
pandas
Using mask
+ fillna
+ assign
使用
mask
+ fillna
+ assign
d1 = df.mask(df == '')
df.assign(sym=d1.e_id.fillna(d1.r_id)).dropna(subset=['sym'])
e_id r_id sym
0 101 101
1 502 502
2 103 103
3 504 504
4 105 105
5 506 506
How It Works 这个怎么运作
''
values with the assumption that you meant those to be null ''
值是空值,以掩盖您''
值 fillna
I take e_id
if it's not null otherwise take r_id
if it's not null fillna
如果e_id
不为null,则使用e_id
否则,如果r_id
不为null,则使用r_id
dropna
with subset=['sym']
only drops the row if the new column is null and that is only null if both e_id
and r_id
were null r_id
null时, r_id
的值为subset=['sym']
dropna
才删除行,并且仅当e_id
和r_id
均为null r_id
null numpy
Using np.where
+ assign
使用
np.where
+ assign
e = df.e_id.values
r = df.r_id.values
df.assign(
sym=np.where(
e != '', e,
np.where(r != '', r, np.nan)
)
).dropna(subset=['sym'])
e_id r_id sym
0 101 101
1 502 502
2 103 103
3 504 504
4 105 105
5 506 506
numpy
v2 numpy
v2
Reconstruct the dataframe from values 从值重建数据框
v = df.values
m = (v != '').any(1)
v = v[m]
c1 = v[:, 0]
c2 = v[:, 1]
pd.DataFrame(
np.column_stack([v, np.where(c1 != '', c1, c2)]),
df.index[m], df.columns.tolist() + ['sym']
)
e_id r_id sym
0 101 101
1 502 502
2 103 103
3 504 504
4 105 105
5 506 506
Timing 定时
%%timeit
e = df.e_id.values
r = df.r_id.values
df.assign(sym=np.where(e != '', e, np.where(r != '', r, np.nan))).dropna(subset=['sym'])
1000 loops, best of 3: 1.23 ms per loop
%%timeit
d1 = df.mask(df == '')
df.assign(sym=d1.e_id.fillna(d1.r_id)).dropna(subset=['sym'])
100 loops, best of 3: 2.44 ms per loop
%%timeit
v = df.values
m = (v != '').any(1)
v = v[m]
c1 = v[:, 0]
c2 = v[:, 1]
pd.DataFrame(
np.column_stack([v, np.where(c1 != '', c1, c2)]),
df.index[m], df.columns.tolist() + ['sym']
)
1000 loops, best of 3: 204 µs per loop
First filter both empty columns by boolean indexing
with any
: 首先通过使用
any
进行boolean indexing
过滤两个空列:
df = df[(df != '').any(1)]
#alternatively
#df = df[(df['e_id'] != '') | (df['r_id'] != '')]
Then use mask
with combine_first
: 然后将
mask
与combine_first
一起combine_first
:
df['sym'] = df['e_id'].mask(df['e_id'] == '').combine_first(df['r_id'])
print (df)
e_id r_id sym
0 101 101
1 502 502
2 103 103
3 504 504
4 105 105
5 506 506
Numpy solution with filtering and numpy.where
: 带过滤和
numpy.where
解决方案:
df = df[(df['e_id'] != '') | (df['r_id'] != '')]
e_id = df.e_id.values
r_id = df.r_id.values
df['sym'] = np.where(e_id != '', e_id, r_id)
print (df)
e_id r_id sym
0 101 101
1 502 502
2 103 103
3 504 504
4 105 105
5 506 506
You can start with column 'e_id' and replace its values with 'r_id' values whenever 'e_id' is "empty", using pandas.DataFrame.mask
and the 'other'
parameter: 您可以使用列
pandas.DataFrame.mask
和'other'
参数,从列“ e_id”开始,并在“ e_id”为“空”时将其值替换为“ r_id”值:
df['sym'] = df['e_id'].mask(df['e_id'] == '', other=df['r_id'], axis=0)
then you just need to remove rows where sym
is "empty" 那么您只需要删除
sym
为“空”的行
df = df[df.sym!='']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.