無法在 Python 中重現 R data.table::foverlaps

Question

我在重疊基因組學區間問題的上下文中使用data.table::foverlaps 。 我最近開始嘗試在 Python 中找到等效的 foverlaps，因為每次我必須挖掘分析輸出時，只使用一種語言而不是組合 Python 和 R 會更好。 當然，我不是第一個提出在 Python 熊貓中找到適用於 Python 中的 R foverlaps 的等效項的問題的人。 這些是我在 SO 上找到的最相關的帖子：

2015 合並 Pandas 數據框，其中一個值介於另外兩個值之間

2016 R foverlaps 在 Python 中等效

2017 如何加入列值在一定范圍內的兩個數據框？

2018 如何在 Python 中合並大熊貓，在 R 中重現相同的 foverlaps 輸出？

問題是我根本不是 Python 專家。 所以我選擇了對我來說最相關/最容易理解的答案，即sqlite3 。

這就是我在 R 中的做法：

library(data.table)

interv1 <- cbind(seq(from = 3, to = 40, by = 4),seq(from = 5, to = 50, by = 5), c(rep("blue",5), rep("red", 5)), rep("+",10))
interv2 <- cbind(seq(from = 3, to = 40, by = 4),seq(from = 5, to = 50, by = 5), c(rep("blue",5), rep("red", 5)), rep("-",10))
interv  <- rbind(interv1, interv2)
interv <- data.table(interv)
colnames(interv) <- c('start', 'stop', 'color', 'strand')
interv$start <- as.integer(interv$start)
interv$stop <- as.integer(interv$stop)
interv$stop <- interv$stop -1
interv$cov <- runif(n=nrow(interv), min = 10, max = 200)

to_match <- data.table(cbind(rep(seq(from = 4, to = 43, by = 4),2), rep(c(rep("blue", 5), rep("red", 5)), 2), c(rep("-", 10), rep("+", 10))))
colnames(to_match) <- c('start', 'color', 'strand')
to_match$stop <-  to_match$start 
to_match$start <- as.integer(to_match$start)
to_match$stop <- as.integer(to_match$stop)

setkey(interv, color, strand, start, stop)
setkey(to_match, color, strand, start, stop)

overlapping_df <- foverlaps(to_match,interv)

#write.csv(x = interv, file = "Documents/script/SO/wig_foverlaps_test.txt", row.names = F)
#write.csv(x = to_match, file = "Documents/script/SO/cov_foverlaps_test.txt", row.names = F)

這就是我嘗試在 python 中重現它的方式：

import pandas as pd
import sqlite3

cov_table = pd.DataFrame(pd.read_csv('SO/cov_foverlaps_test.txt', skiprows = [0], header=None))
cov_table.columns = ['start', 'stop', 'chrm', 'strand', 'cov']
cov_table.stop = cov_table.stop - 1


wig_file = pd.DataFrame(pd.read_csv('SO/wig_foverlaps_test.txt', header=None, skiprows = [0]))
wig_file.columns = ['i_start', 'chrm', 'i_strand', 'i_stop']

cov_cols = ['start','stop','chrm','strand','cov']
fract_cols = ['i_start','i_stop','chrm','i_strand']

cov_table = cov_table.reindex(columns = cov_cols)
wig_file = wig_file.reindex(columns = fract_cols)

cov_table.start = pd.to_numeric(cov_table['start'])
cov_table.stop = pd.to_numeric(cov_table['stop'])

wig_file.i_start = pd.to_numeric(wig_file['i_start'])
wig_file.i_stop = pd.to_numeric(wig_file['i_stop'])



conn = sqlite3.connect(':memory:')

cov_table.to_sql('cov_table', conn, index=False)
wig_file.to_sql('wig', conn, index=False)

qry = '''
    select  
        start PresTermStart,
        stop PresTermEnd,
        cov RightCov,
        i_start pos,
        strand Strand
    from
        cov_table join wig on
        i_start between start and stop and 
        cov_table.strand = wig.i_strand
     '''

test = pd.read_sql_query(qry, conn)

無論我更改代碼，我總是在輸出（測試）中發現一些小的差異，在這個例子中，我在 python 結果表中缺少兩行，其中的值應該落在范圍內並且是相等的到范圍的末尾：

缺線：

> 19   24  141.306318     24      +
> 
> 19   24  122.923700     24      -

最后，我擔心如果我找到正確的方法來使用sqlite3 ，那么與data.table::foverlaps的計算時間差異會太大。

總結：

我的第一個問題是 ofc 我的代碼哪里出錯了？
在計算速度方面，是否有一種更適合並接近 foverlaps 的方法？

感謝您的閱讀，我希望這篇文章適合 SO。

Answer 1

本質上，R 和 Python 之間的合並和間隔邏輯是不同的。

電阻

根據foverlaps文檔，您使用的是默認值，任何在以下條件下運行的類型：

設 [a,b] 和 [c,d] 是 x 和 y 中的區間，a<=b 和 c<=d。
...
對於 type="any"，只要 c<=b 和 d>=a，它們就重疊。

此外，您還可以加入其他鍵列。 總而言之，您正在強加以下邏輯（轉換為 SQLite 列以進行比較）：

foverlaps(to_match, interv) --> foverlaps(cov_table, wig)

wig.i_start <= cov_table.stop (ie, c <= b)
wig.i_stop >= cov_table.start (ie, d >= a)
wig.color == cov_table.color
wig.strand == cov_table.strand

Python

您正在運行一個INNER JOIN + 間隔查詢，強加以下邏輯：

wig.i_start >= cov_table.start (ie, i_start between start and stop)
wig.i_start <= cov_table.stop (ie, i_start between start and stop)
wig.strand == cov_table.strand

與 R 相比，Python 的顯着差異：從未使用wig.i_stop ； wig.i_chrm （或顏色）從未被使用； 並且wig.i_start被調節兩次。

要解決，請考慮以下未經測試的 SQL 調整，以期達到 R 結果。 順便說一句，在 SQL 中為JOIN子句（甚至SELECT ）中的所有列設置別名是最佳實踐：

select  
   cov_table.start as PresTermStart,
   cov_table.stop as PresTermEnd,
   cov_table.cov as RightCov,
   wig.i_start as pos,
   wig.strand as Strand
from
   cov_table 
join wig 
    on cov_table.color = wig.i_chrm
   and cov_table.strand = wig.i_strand
   and wig.i_start <= cov_table.stop 
   and wig.i_stop  >= cov_table.start

為了獲得更好的性能，請考慮使用持久（非內存）SQLite 數據庫並在連接字段上創建索引： color 、 strand 、 start和stop 。

Answer 2

要在 Python 中進行間隔重疊，只需使用pyranges ：

import pyranges as pr

c1 = """Chromosome Start End Gene
1 10 20 blo
1 45 46 bla"""

c2 = """Chromosome Start End Gene
1 10 35 bip
1 25 50 P53
1 40 10000 boop"""


gr1, gr2 = pr.from_string(c1), pr.from_string(c2)

j = gr1.join(gr2)
# +--------------+-----------+-----------+------------+-----------+-----------+------------+
# |   Chromosome |     Start |       End | Gene       |   Start_b |     End_b | Gene_b     |
# |   (category) |   (int32) |   (int32) | (object)   |   (int32) |   (int32) | (object)   |
# |--------------+-----------+-----------+------------+-----------+-----------+------------|
# |            1 |        10 |        20 | blo        |        10 |        35 | bip        |
# |            1 |        45 |        46 | bla        |        25 |        50 | P53        |
# |            1 |        45 |        46 | bla        |        40 |     10000 | boop       |
# +--------------+-----------+-----------+------------+-----------+-----------+------------+
# Unstranded PyRanges object has 3 rows and 7 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.

無法在 Python 中重現 R data.table::foverlaps

問題描述

這就是我在 R 中的做法：

這就是我嘗試在 python 中重現它的方式：

缺線：

2 個解決方案

解決方案1
1 2019-05-22 18:17:17

解決方案2
1 已采納 2020-04-22 08:44:15

無法在 Python 中重現 R data.table::foverlaps

問題描述

這就是我在 R 中的做法：

這就是我嘗試在 python 中重現它的方式：

缺線：

2 個解決方案

解決方案1 1 2019-05-22 18:17:17

解決方案2 1 已采納 2020-04-22 08:44:15

解決方案1
1 2019-05-22 18:17:17

解決方案2
1 已采納 2020-04-22 08:44:15