简体   繁体   中英

Merging pandas dataframes on multiple conditions (python/pandas)

I have a Python/Pandas dataframe ( df1 ) consisting of an ID, Chr and position. and a dataframe consisting of the same kind of data (ID, Chr, Position), df2 .

I would like to obtain a third dataframe ( df3 ) that only keeps the rows of df1 based on the Chr-column between df1 and df2, and a position that is within the pos-start and pos-end of df2; additionally it needs to add an ID or row of df2 of which the match was originating.

I found this to be very difficult, does anyone have an idea?

please see below examples:

df1 :

ID1 Chr pos
a   12  500
b   12  250
c   12  300
d   16  2000
e   16  1050
f   16  1075
d   16  1150
g   17  8000
h   17  550
i   17  500

df2 :

ID2 Chr pos-start   pos-end
x   12  200      400
y   16  1000    1100
z   16  1070    1200

resulting df3 :

ID2 ID1 Chr Pos
x   b   12  250
x   c   12  300
y   e   16  1050
y   f   16  1000
z   f   16  1075
z   d   16  1150

One way is to do the plain old merge then throw away the values out of the range:

In [11]: df3 = df1.merge(df2)

In [12]: df3
Out[12]:
   ID1  Chr   pos ID2  pos-start  pos-end
0    a   12   500   x        200      400
1    b   12   250   x        200      400
2    c   12   300   x        200      400
3    d   16  2000   y       1000     1100
4    d   16  2000   z       1070     1200
5    e   16  1050   y       1000     1100
6    e   16  1050   z       1070     1200
7    f   16  1075   y       1000     1100
8    f   16  1075   z       1070     1200
9    d   16  1150   y       1000     1100
10   d   16  1150   z       1070     1200

In [13]: df3[(df3["pos-start"] < df3["pos"]) & (df3["pos"] < df3["pos-end"])]
Out[13]:
   ID1  Chr   pos ID2  pos-start  pos-end
1    b   12   250   x        200      400
2    c   12   300   x        200      400
5    e   16  1050   y       1000     1100
7    f   16  1075   y       1000     1100
8    f   16  1075   z       1070     1200
10   d   16  1150   z       1070     1200

and discard the columns you don't want:

In [14]: df3[(df3["pos-start"] < df3["pos"]) & (df3["pos"] < df3["pos-end"])][['ID2', 'ID1', 'Chr', 'pos']]
Out[14]:
   ID2 ID1  Chr   pos
1    x   b   12   250
2    x   c   12   300
5    y   e   16  1050
7    y   f   16  1075
8    z   f   16  1075
10   z   d   16  1150

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM