简体   繁体   中英

pythonic equivalent to reduce() in R GRanges - how to collapse ranged data?

In R (albeit longwinded):

Here is a test data.frame

df <- data.frame(
  "CHR" = c(1,1,1,2,2),
  "START" = c(100, 200, 300, 100, 400),
  "STOP" = c(150,350,400,500,450)
  )

First I make GRanges object:

gr <- GenomicRanges::GRanges(
  seqnames = df$CHR,
  ranges = IRanges(start = df$START, end = df$STOP)
  )

Then I reduce the intervals to collapse into new granges object:

reduced <- reduce(gr)

Now append a new column to original dataframe which confirms which rows belong to the same contiguous 'chunk'.

subjectHits(findOverlaps(gr, reduced))

Output:

> df
  CHR START STOP locus
1   1   100  150     1
2   1   200  350     2
3   1   300  400     2
4   2   100  500     3
5   2   400  450     3

How do I do this in Python? I am aware of pybedtools, but to my knowledge, this would require me to save my data.frame to disk. Any help appreciated.

https://github.com/biocore-ntnu/pyranges

import pyranges as pr
chromosomes = [1] * 3 + [2] * 2
starts = [100, 200, 300, 100, 400]
ends = [150, 350, 400, 500, 450]
gr = pr.PyRanges(chromosomes=chromosomes, starts=starts, ends=ends)
gr.cluster()
# +--------------+-----------+-----------+-----------+
# |   Chromosome |     Start |       End |   Cluster |
# |       (int8) |   (int32) |   (int32) |   (int64) |
# |--------------+-----------+-----------+-----------|
# |            1 |       100 |       150 |         1 |
# |            1 |       200 |       350 |         2 |
# |            1 |       300 |       400 |         2 |
# |            2 |       100 |       500 |         3 |
# |            2 |       400 |       450 |         3 |
# +--------------+-----------+-----------+-----------+

It will be out in 0.0.21. Thanks for the idea!

It appears you are trying to get the intersection of these. Pybedtools will accept streams as an input. Read your data into a string that is in bed format.

"chr,start,stop"

I start with a python dictionary and loop through it.

bed_string += "{0} {1} {2} {3} {0}|{1}|{2}|{3}\n".format(chrom, coord_start, coord_stop, aberration)
# Now create your bedtools.
breakpoint_bedtool = pybedtools.BedTool(bed_string, from_string=True)
target_bedtool = pybedtools.BedTool(self.args.Target_Bed_File, from_string=False)
# Find target intersects for printing.
breakpoint_target_intersect = breakpoint_bedtool.intersect(target_bedtool, wb=True, stream=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM