简体   繁体   English

pythonic相当于R GRanges中的reduce() - 如何折叠范围数据?

[英]pythonic equivalent to reduce() in R GRanges - how to collapse ranged data?

In R (albeit longwinded):在 R 中(虽然冗长):

Here is a test data.frame这是一个测试 data.frame

df <- data.frame(
  "CHR" = c(1,1,1,2,2),
  "START" = c(100, 200, 300, 100, 400),
  "STOP" = c(150,350,400,500,450)
  )

First I make GRanges object:首先我制作 GRanges 对象:

gr <- GenomicRanges::GRanges(
  seqnames = df$CHR,
  ranges = IRanges(start = df$START, end = df$STOP)
  )

Then I reduce the intervals to collapse into new granges object:然后我减少了折叠成新农庄对象的间隔:

reduced <- reduce(gr)

Now append a new column to original dataframe which confirms which rows belong to the same contiguous 'chunk'.现在将一个新列附加到原始数据帧,以确认哪些行属于同一个连续的“块”。

subjectHits(findOverlaps(gr, reduced))

Output:输出:

> df
  CHR START STOP locus
1   1   100  150     1
2   1   200  350     2
3   1   300  400     2
4   2   100  500     3
5   2   400  450     3

How do I do this in Python?我如何在 Python 中做到这一点? I am aware of pybedtools, but to my knowledge, this would require me to save my data.frame to disk.我知道 pybedtools,但据我所知,这需要我将 data.frame 保存到磁盘。 Any help appreciated.任何帮助表示赞赏。

https://github.com/biocore-ntnu/pyranges https://github.com/biocore-ntnu/pyranges

import pyranges as pr
chromosomes = [1] * 3 + [2] * 2
starts = [100, 200, 300, 100, 400]
ends = [150, 350, 400, 500, 450]
gr = pr.PyRanges(chromosomes=chromosomes, starts=starts, ends=ends)
gr.cluster()
# +--------------+-----------+-----------+-----------+
# |   Chromosome |     Start |       End |   Cluster |
# |       (int8) |   (int32) |   (int32) |   (int64) |
# |--------------+-----------+-----------+-----------|
# |            1 |       100 |       150 |         1 |
# |            1 |       200 |       350 |         2 |
# |            1 |       300 |       400 |         2 |
# |            2 |       100 |       500 |         3 |
# |            2 |       400 |       450 |         3 |
# +--------------+-----------+-----------+-----------+

It will be out in 0.0.21.它将在 0.0.21 中推出。 Thanks for the idea!谢谢你的主意!

It appears you are trying to get the intersection of these.看来您正试图获得这些的交集。 Pybedtools will accept streams as an input. Pybedtools 将接受流作为输入。 Read your data into a string that is in bed format.将您的数据读入一个采用床格式的字符串。

"chr,start,stop" “chr,开始,停止”

I start with a python dictionary and loop through it.我从一个 python 字典开始并循环遍历它。

bed_string += "{0} {1} {2} {3} {0}|{1}|{2}|{3}\n".format(chrom, coord_start, coord_stop, aberration)
# Now create your bedtools.
breakpoint_bedtool = pybedtools.BedTool(bed_string, from_string=True)
target_bedtool = pybedtools.BedTool(self.args.Target_Bed_File, from_string=False)
# Find target intersects for printing.
breakpoint_target_intersect = breakpoint_bedtool.intersect(target_bedtool, wb=True, stream=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM