In R (albeit longwinded):
Here is a test data.frame
df <- data.frame(
"CHR" = c(1,1,1,2,2),
"START" = c(100, 200, 300, 100, 400),
"STOP" = c(150,350,400,500,450)
)
First I make GRanges object:
gr <- GenomicRanges::GRanges(
seqnames = df$CHR,
ranges = IRanges(start = df$START, end = df$STOP)
)
Then I reduce the intervals to collapse into new granges object:
reduced <- reduce(gr)
Now append a new column to original dataframe which confirms which rows belong to the same contiguous 'chunk'.
subjectHits(findOverlaps(gr, reduced))
Output:
> df
CHR START STOP locus
1 1 100 150 1
2 1 200 350 2
3 1 300 400 2
4 2 100 500 3
5 2 400 450 3
How do I do this in Python? I am aware of pybedtools, but to my knowledge, this would require me to save my data.frame to disk. Any help appreciated.
https://github.com/biocore-ntnu/pyranges
import pyranges as pr
chromosomes = [1] * 3 + [2] * 2
starts = [100, 200, 300, 100, 400]
ends = [150, 350, 400, 500, 450]
gr = pr.PyRanges(chromosomes=chromosomes, starts=starts, ends=ends)
gr.cluster()
# +--------------+-----------+-----------+-----------+
# | Chromosome | Start | End | Cluster |
# | (int8) | (int32) | (int32) | (int64) |
# |--------------+-----------+-----------+-----------|
# | 1 | 100 | 150 | 1 |
# | 1 | 200 | 350 | 2 |
# | 1 | 300 | 400 | 2 |
# | 2 | 100 | 500 | 3 |
# | 2 | 400 | 450 | 3 |
# +--------------+-----------+-----------+-----------+
It will be out in 0.0.21. Thanks for the idea!
It appears you are trying to get the intersection of these. Pybedtools will accept streams as an input. Read your data into a string that is in bed format.
"chr,start,stop"
I start with a python dictionary and loop through it.
bed_string += "{0} {1} {2} {3} {0}|{1}|{2}|{3}\n".format(chrom, coord_start, coord_stop, aberration)
# Now create your bedtools.
breakpoint_bedtool = pybedtools.BedTool(bed_string, from_string=True)
target_bedtool = pybedtools.BedTool(self.args.Target_Bed_File, from_string=False)
# Find target intersects for printing.
breakpoint_target_intersect = breakpoint_bedtool.intersect(target_bedtool, wb=True, stream=True)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.