I have a Python/Pandas dataframe ( df1 ) consisting of an ID, Chr and position. and a dataframe consisting of the same kind of data (ID, Chr, Position), df2 .
I would like to obtain a third dataframe ( df3 ) that only keeps the rows of df1 based on the Chr-column between df1 and df2, and a position that is within the pos-start and pos-end of df2; additionally it needs to add an ID or row of df2 of which the match was originating.
I found this to be very difficult, does anyone have an idea?
please see below examples:
df1 :
ID1 Chr pos
a 12 500
b 12 250
c 12 300
d 16 2000
e 16 1050
f 16 1075
d 16 1150
g 17 8000
h 17 550
i 17 500
df2 :
ID2 Chr pos-start pos-end
x 12 200 400
y 16 1000 1100
z 16 1070 1200
resulting df3 :
ID2 ID1 Chr Pos
x b 12 250
x c 12 300
y e 16 1050
y f 16 1000
z f 16 1075
z d 16 1150
One way is to do the plain old merge then throw away the values out of the range:
In [11]: df3 = df1.merge(df2)
In [12]: df3
Out[12]:
ID1 Chr pos ID2 pos-start pos-end
0 a 12 500 x 200 400
1 b 12 250 x 200 400
2 c 12 300 x 200 400
3 d 16 2000 y 1000 1100
4 d 16 2000 z 1070 1200
5 e 16 1050 y 1000 1100
6 e 16 1050 z 1070 1200
7 f 16 1075 y 1000 1100
8 f 16 1075 z 1070 1200
9 d 16 1150 y 1000 1100
10 d 16 1150 z 1070 1200
In [13]: df3[(df3["pos-start"] < df3["pos"]) & (df3["pos"] < df3["pos-end"])]
Out[13]:
ID1 Chr pos ID2 pos-start pos-end
1 b 12 250 x 200 400
2 c 12 300 x 200 400
5 e 16 1050 y 1000 1100
7 f 16 1075 y 1000 1100
8 f 16 1075 z 1070 1200
10 d 16 1150 z 1070 1200
and discard the columns you don't want:
In [14]: df3[(df3["pos-start"] < df3["pos"]) & (df3["pos"] < df3["pos-end"])][['ID2', 'ID1', 'Chr', 'pos']]
Out[14]:
ID2 ID1 Chr pos
1 x b 12 250
2 x c 12 300
5 y e 16 1050
7 y f 16 1075
8 z f 16 1075
10 z d 16 1150
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.