簡體   English   中英

如何通過匹配兩個具有 +-5 范圍的數字列來合並兩個數據框?

[英]How to merge two data frames by matching two numeric columns with a +-5 range?

我有兩個數據框如下:

df1 <- data.frame(chrom = c(1,1,3,6,6),
                  chromStart = c(15433, 1959,34205,35043, 77456),
                  chromEnd = c(15700, 2001,36245,36245,78469), 
                  id = c('aaad', 'dfk', 'bb', 'llk', 'ie9o'))

df2 <- data.frame(chrom = c(1,1,5,1,6),
                  chromStart2 = c(15433, 1961,34205,1962, 77456),
                  chromEnd2 = c(15700, 2002,36245,1999,78480))

我想通過匹配chrom == chrom, chromStart = between(chromStart2 -5, chromStart2 +5)chromEnd = between(chromEnd2 -5, chromEnd2 +5)來合並兩個數據幀。 我試過的是:

library(dplyr)
colnames(df2) <- c('chrom','chromStart', 'chromEnd')
merged <- inner_join(df1,df2)

然而,這只匹配精確的chromStartchromEnd ,在我們的例子中只有aaad匹配。 我想給它一個加號或減號的范圍,以便dfk匹配。 我的實際數據幀是 260000 行和 179000 行,所以如果可能的話我更喜歡 memory 有效的方法。 以下是我正在尋找的結果:

data.frame(chrom = c(1,1,1),
           chromStart = c(15433, 1959,1959),
           chromEnd = c(15700, 2001,2001), 
           id = c('aaad', 'dfk', 'dfk'),
           chromStart2 = c(15433, 1961,1962),
           chromEnd2 = c(15700, 2002,1999))

可能有更好/更有效的方法,但這些方法應該有效。

dplyr方法:根據您的條件創建兩個臨時邏輯向量,然后根據滿足這兩個條件的那些filter s,然后刪除 ( select ) 臨時列:

merged <- inner_join(df1, df2) %>%
  mutate(
    inStart = chromStart >= chromStart2 - 5 & chromStart <= chromStart2 + 5,
    inEnd = chromEnd >= chromEnd2 - 5 & chromEnd <= chromEnd + 5) %>%
  filter(inStart, inEnd) %>%
  select(-inStart, -inEnd)

### or in one `mutate` command:
# merged <- inner_join(df1, df2) %>%
#   mutate(inrows  =  (chromStart >= chromStart2 - 5 & chromStart <= #chromStart2 + 5) &
#       (chromEnd >= chromEnd2 - 5 & chromEnd <= chromEnd + 5)) %>%
#   filter(inrows) %>%
#   select(-inrows)

Output:

#   chrom chromStart chromEnd   id chromStart2 chromEnd2
# 1     1      15433    15700 aaad       15433     15700
# 2     1       1959     2001  dfk        1961      2002
# 3     1       1959     2001  dfk        1962      1999

並檢查以確保它完全符合最終所需的數據:

all.equal(merged,
          data.frame(chrom = c(1,1,1),
           chromStart = c(15433, 1959,1959),
           chromEnd = c(15700, 2001,2001), 
           id = c('aaad', 'dfk', 'dfk'),
           chromStart2 = c(15433, 1961,1962),
           chromEnd2 = c(15700, 2002,1999))
)
# [1] TRUE

base R方法:通過識別滿足相同條件的行來對數據進行子集化

base1 <- merge(df1, df2, by = "chrom")

base_merged <- base1[(base1$chromStart >= base1$chromStart2 - 5 & base1$chromStart <= base1$chromStart2 + 5) &
        (base1$chromEnd >= base1$chromEnd2 - 5 & base1$chromEnd <= base1$chromEnd + 5),]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM