I want to identify rows with start- and stop positions that overlap with the start- and stop position of other rows. There are a few restrictions that apply:
Below is a minimal representation of the data set:
id type start stop
0 1 AP 0 10
1 2 AP 3 7
2 3 ES 5 15
3 4 ES 12 18
Here's an image that describes the problem better. Each box represent an event/row, and the number represents their ID
:
And here's my desired output:
id type start stop number_of_overlapping_exons
0 1 AP 0 10 2
1 2 AP 3 7 2
I want to find the rows with type
equal to AP that have other rows (of any type) that overlap their position. In the image above, the blue boxes represent AP events. There are two events/rows overlapping blue box 1 (boxes 2 and 3), so the number_of_overlapping_exons
for ID
1 should be 2. Blue box 2 also has two overlapping events (boxes 1 and 3). Here's what I've got so far:
import pandas as pd
# Sample input
df = pd.DataFrame({
"id": [1, 2, 3, 4],
"type": ["AP", "AP", "ES", "ES"],
"start": [0, 3, 5, 12],
"stop": [10, 7, 15, 18]
})
# Extract only AP events
ap = df.loc[df.type == "AP"]
# Find events that overlap start positions in "ap"
# by identifying "start" or "stop" positions in "df"
# that are greater or equal to "start" positions in "ap".
overlapping_start_positions = df.loc[(df.start >= ap.start) | (df.stop >= ap.start)]
# Find events that overlap stop positions in "ap"
# by identifying "start" or "stop" positions in "df"
# that are smaller or equal to "stop" positions in "ap".
overlapping_stop_positions = df.loc[(df.start <= ap.stop) | (df.stop <= ap.stop)]
I'm getting a ValueError
when doing overlapping_start_positions
saying
ValueError: Can only compare identically-labeled Series objects
EDIT:
Come to think of it, condition 3:
is not really required. All events will overlap with themselves, so I can just subtract 1 from number_of_overlapping_exons
.
I think there is a clever way to do this in one pass, but a brute force solution is to just loop over the rows in the dataframe.
For example:
import pandas as pd
# Sample input
df = pd.DataFrame({
"id": [1, 2, 3, 4],
"type": ["AP", "AP", "ES", "ES"],
"start": [0, 3, 5, 12],
"stop": [10, 7, 15, 18]
})
df['count'] = 0
for row in df.itertuples():
mask = (row.start <= df.stop) & (row.stop >= df.start)
df.loc[row.Index, 'count'] = sum(mask) - 1
And we get
id start stop type count
0 1 0 10 AP 2
1 2 3 7 AP 2
2 3 5 15 ES 3
3 4 12 18 ES 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.