I have a dataframe in python
such as:
seqnames start end name number strand
A 50 453 A 1 -
B 30 322 A 2 -
C 10 432 A 3 -
D 36 344 A 4 +
E 40 321 A 5 +
F 78 234 A 6 -
and I would like to change de values in the start and end columns depending on the symbole in the strand column
.
So for each line, if the strand
is -
than do start+1
and end-2
if the strand is +
than do nothing
here I should get:
seqnames start end name number strand
A 51 451 A 1 -
B 31 320 A 2 -
C 11 430 A 3 -
D 36 344 A 4 +
E 40 321 A 5 +
F 79 232 A 6 -
Thank you for your help
Use:
df[['start','end']]=np.where(df['strand'].eq('-')[:,None],
np.column_stack((df['start']+1,df['end']-2)),
df[['start','end']].values)
print(df)
seqnames start end name number strand
0 A 51 451 A 1 -
1 B 31 320 A 2 -
2 C 11 430 A 3 -
3 D 36 344 A 4 +
4 E 40 321 A 5 +
5 F 79 232 A 6 -
Use Series.mask
:
df['start'].mask(df['strand']=='-',df['start']+1,inplace=True)
df['end'].mask(df['strand']=='-',df['end']-2,inplace=True)
print(df)
seqnames start end name number strand
0 A 51 451 A 1 -
1 B 31 320 A 2 -
2 C 11 430 A 3 -
3 D 36 344 A 4 +
4 E 40 321 A 5 +
5 F 79 232 A 6 -
Also you can use DataFrame.apply
+ DataFrame.where
:
df[['start','end']]=( df[['start','end']]
.apply(lambda x: pd.Series((x['start']+1,x['end']-2)).rename({0:'start',1:'end'}),axis=1)
.where(df['strand']=='-',df[['start','end']])
)
print(df)
seqnames start end name number strand
0 A 51 451 A 1 -
1 B 31 320 A 2 -
2 C 11 430 A 3 -
3 D 36 344 A 4 +
4 E 40 321 A 5 +
5 F 79 232 A 6 -
Use DataFrame.loc
:
df.loc[ df['strand'] == '-', ['start', 'end']] += [1, -2]
print (df)
seqnames start end name number strand
0 A 51 451 A 1 -
1 B 31 320 A 2 -
2 C 11 430 A 3 -
3 D 36 344 A 4 +
4 E 40 321 A 5 +
5 F 79 232 A 6 -
Or use numpy.where
for add or subtract values:
m = df['strand'] == '-'
df['start'] = df['start'] + np.where(m, 1 ,0)
df['end'] = df['end'] - np.where(m, 2, 0)
Or convert mask to integer and for second value only multiple by 2
:
m = df['strand'] == '-'
df['start'] = df['start'] + m.astype(int)
df['end'] = df['end'] - m.astype(int) * 2
print (df)
seqnames start end name number strand
0 A 51 451 A 1 -
1 B 31 320 A 2 -
2 C 11 430 A 3 -
3 D 36 344 A 4 +
4 E 40 321 A 5 +
5 F 79 232 A 6 -
Another one liner could be:
df.loc[ df['strand'] == '-', 'start'] = df.loc[ df['strand'] == '-', 'start'] + 1
which uses loc
for indexing.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.