简体   繁体   中英

Pandas: Finding overlapping regions based on start- and stop coordinates

I want to identify rows with start- and stop positions that overlap with the start- and stop position of other rows. There are a few restrictions that apply:

  1. The rows I want to check are a subset of the entire data set
  2. The rows in the subset should be compared to the entire data set
  3. The rows should not be compared with themselves

Below is a minimal representation of the data set:

   id type  start  stop
0   1   AP      0    10
1   2   AP      3     7
2   3   ES      5    15
3   4   ES     12    18

Here's an image that describes the problem better. Each box represent an event/row, and the number represents their ID : 重叠外显子

And here's my desired output:

   id type  start  stop  number_of_overlapping_exons
0   1   AP      0    10                            2
1   2   AP      3     7                            2

I want to find the rows with type equal to AP that have other rows (of any type) that overlap their position. In the image above, the blue boxes represent AP events. There are two events/rows overlapping blue box 1 (boxes 2 and 3), so the number_of_overlapping_exons for ID 1 should be 2. Blue box 2 also has two overlapping events (boxes 1 and 3). Here's what I've got so far:

import pandas as pd

# Sample input
df = pd.DataFrame({
    "id": [1, 2, 3, 4],
    "type": ["AP", "AP", "ES", "ES"],
    "start": [0, 3, 5, 12],
    "stop": [10, 7, 15, 18]
})

# Extract only AP events
ap = df.loc[df.type == "AP"]

# Find events that overlap start positions in "ap"
# by identifying "start" or "stop" positions in "df"
# that are greater or equal to "start" positions in "ap".
overlapping_start_positions = df.loc[(df.start >= ap.start) | (df.stop >= ap.start)]
# Find events that overlap stop positions in "ap"
# by identifying "start" or "stop" positions in "df"
# that are smaller or equal to "stop" positions in "ap".
overlapping_stop_positions = df.loc[(df.start <= ap.stop) | (df.stop <= ap.stop)]

I'm getting a ValueError when doing overlapping_start_positions saying

ValueError: Can only compare identically-labeled Series objects

EDIT:

Come to think of it, condition 3:

  1. The rows should not be compared with themselves

is not really required. All events will overlap with themselves, so I can just subtract 1 from number_of_overlapping_exons .

I think there is a clever way to do this in one pass, but a brute force solution is to just loop over the rows in the dataframe.

For example:

import pandas as pd

# Sample input
df = pd.DataFrame({
    "id": [1, 2, 3, 4],
    "type": ["AP", "AP", "ES", "ES"],
    "start": [0, 3, 5, 12],
    "stop": [10, 7, 15, 18]
})
df['count'] = 0

for row in df.itertuples():
    mask = (row.start <= df.stop) & (row.stop >= df.start)
    df.loc[row.Index, 'count'] = sum(mask) - 1

And we get

   id  start  stop type  count
0   1      0    10   AP      2
1   2      3     7   AP      2
2   3      5    15   ES      3
3   4     12    18   ES      1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM