如何通过匹配两个具有 +-5 范围的数字列来合并两个数据框？

Question

I have two data frames as below:我有两个数据框如下：

df1 <- data.frame(chrom = c(1,1,3,6,6),
                  chromStart = c(15433, 1959,34205,35043, 77456),
                  chromEnd = c(15700, 2001,36245,36245,78469), 
                  id = c('aaad', 'dfk', 'bb', 'llk', 'ie9o'))

df2 <- data.frame(chrom = c(1,1,5,1,6),
                  chromStart2 = c(15433, 1961,34205,1962, 77456),
                  chromEnd2 = c(15700, 2002,36245,1999,78480))

I'd like to merge the two data frames by matching chrom == chrom, chromStart = between(chromStart2 -5, chromStart2 +5) and chromEnd = between(chromEnd2 -5, chromEnd2 +5) .我想通过匹配chrom == chrom, chromStart = between(chromStart2 -5, chromStart2 +5)和chromEnd = between(chromEnd2 -5, chromEnd2 +5)来合并两个数据帧。 What I've tried is:我试过的是：

library(dplyr)
colnames(df2) <- c('chrom','chromStart', 'chromEnd')
merged <- inner_join(df1,df2)

However, that only matches the exact chromStart and chromEnd , in our case only aaad matches.然而，这只匹配精确的chromStart和chromEnd ，在我们的例子中只有aaad匹配。 I'd like to give it a range of plus or minus so that dfk matches as well.我想给它一个加号或减号的范围，以便dfk匹配。 My actual dataframes are 260000 rows and 179000 rows, so I would prefer a memory efficient way if possible.我的实际数据帧是 260000 行和 179000 行，所以如果可能的话我更喜欢 memory 有效的方法。 Here are the results I'm looking for:以下是我正在寻找的结果：

data.frame(chrom = c(1,1,1),
           chromStart = c(15433, 1959,1959),
           chromEnd = c(15700, 2001,2001), 
           id = c('aaad', 'dfk', 'dfk'),
           chromStart2 = c(15433, 1961,1962),
           chromEnd2 = c(15700, 2002,1999))

Answer 1

There may be better/more efficient ways, but these should work.可能有更好/更有效的方法，但这些方法应该有效。

A dplyr approach: create two temporary logic vectors based on your conditions, then filter s based on those meeting both conditions, then drops ( select ) the temporary columns: dplyr方法：根据您的条件创建两个临时逻辑向量，然后根据满足这两个条件的那些filter s，然后删除 ( select ) 临时列：

merged <- inner_join(df1, df2) %>%
  mutate(
    inStart = chromStart >= chromStart2 - 5 & chromStart <= chromStart2 + 5,
    inEnd = chromEnd >= chromEnd2 - 5 & chromEnd <= chromEnd + 5) %>%
  filter(inStart, inEnd) %>%
  select(-inStart, -inEnd)

### or in one `mutate` command:
# merged <- inner_join(df1, df2) %>%
#   mutate(inrows  =  (chromStart >= chromStart2 - 5 & chromStart <= #chromStart2 + 5) &
#       (chromEnd >= chromEnd2 - 5 & chromEnd <= chromEnd + 5)) %>%
#   filter(inrows) %>%
#   select(-inrows)

Output: Output：

#   chrom chromStart chromEnd   id chromStart2 chromEnd2
# 1     1      15433    15700 aaad       15433     15700
# 2     1       1959     2001  dfk        1961      2002
# 3     1       1959     2001  dfk        1962      1999

And check to ensure it meets final desired data exactly:并检查以确保它完全符合最终所需的数据：

all.equal(merged,
          data.frame(chrom = c(1,1,1),
           chromStart = c(15433, 1959,1959),
           chromEnd = c(15700, 2001,2001), 
           id = c('aaad', 'dfk', 'dfk'),
           chromStart2 = c(15433, 1961,1962),
           chromEnd2 = c(15700, 2002,1999))
)
# [1] TRUE

A base R approach: subset the data by identifying the rows that meet the same conditions base R方法：通过识别满足相同条件的行来对数据进行子集化

base1 <- merge(df1, df2, by = "chrom")

base_merged <- base1[(base1$chromStart >= base1$chromStart2 - 5 & base1$chromStart <= base1$chromStart2 + 5) &
        (base1$chromEnd >= base1$chromEnd2 - 5 & base1$chromEnd <= base1$chromEnd + 5),]

如何通过匹配两个具有 +-5 范围的数字列来合并两个数据框？

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-11-30 18:22:00

如何通过匹配两个具有 +-5 范围的数字列来合并两个数据框？

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-11-30 18:22:00

解决方案1
0 已采纳 2022-11-30 18:22:00