R-比较data.frame的行（以根据条件组合）

Question

我从基因组范围（一条染色体，以及一个开始-结束位置）的data.frame开始。 我正在尝试合并1）相邻放置的行和2）在其他两列中共享一个值的行。 注意：我想要一种有效的方法，因为我的真实数据大于1000万行。 （如果可能，请使用data.table）

玩具数据：

DF <- data.frame(SampleID = c(1,1,1,1,1,2,2),
                 Chr = c(1,1,1,1,2,1,1),
                 Start = c(1, 101, 201, 401, 500, 1, 101),
                 End = c(100, 200, 300, 499, 599, 100, 200),
                 State = c(3,3,2,3,3,2,2)
                 )
DF
   SampleID Chr Start End State
1:        1   1     1 100     3
2:        1   1   101 200     3
3:        1   1   201 300     2
4:        1   1   401 499     3
5:        1   2   500 501     3
6:        2   1     1 100     2
7:        2   1   101 200     2

由于行1和2相邻（1-100和101-200），并且共享SampleID （1）和State （3），因此可以合并。

以下内容不能合并：

第2和3行的State s不匹配
第3行和第4行不相邻并且不共享State
第4行和第5行在染色体（ Chr ）中有所不同
第6和7行是不同的SampleID 。

Etcetera。 当我们应用所有这些时，我们就有了最终表。

FinalDF <- data.frame(SampleID = c(1,1,1,1,2),
                      Chr = c(1,1,1,2,1),
                      Start = c(1,201,401,500,1),
                      End = c(200,300,499,599,200),
                      State = c(3,2,3,3,2))
FinalDF
  SampleID Chr Start End State
1        1   1     1 200     3
2        1   1   201 300     2
3        1   1   401 499     3
4        1   2   500 599     3
5        2   1     1 200     2

到目前为止，我已经尝试过使用GenomicRanges包中的reduce函数，但是它不起作用。

输出不正确

reduce(DF2)
GRanges object with 3 ranges and 0 metadata columns:
      seqnames     ranges strand
         <Rle>  <IRanges>  <Rle>
  [1]        1 [  1, 300]      *
  [2]        1 [401, 499]      *
  [3]        2 [500, 501]      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

我试图对data.table进行操作，因为我的data.frames的行长为1000万行或更多，但还无法弄清。

以下问题是相同的（也许会更复杂），但没有解决方案。 R-根据两列的内容折叠行

Answer 1

library(data.table)

dt = as.data.table(DF) # or convert in place using setDT

dt[, .(Start = min(Start), End = max(End), State = State[1])
   , by = .(SampleID, Chr, rleid(State),
            cumsum(c(FALSE, head(End + 1, -1) < tail(Start, -1))))]
#   SampleID Chr rleid cumsum Start End State
#1:        1   1     1      0     1 200     3
#2:        1   1     2      0   201 300     2
#3:        1   1     3      1   401 499     3
#4:        1   2     3      1   500 599     3
#5:        2   1     4      1     1 200     2

Answer 2

如果我正确解释了您想做什么，我建议采取以下措施：使用dplyr对要分开的元数据进行分组，然后使用GenomicRanges找出每个组中的范围（如果遇到性能问题，则可能需要避开data.frame所需的GenomicRanges并手动实现它以利用dyplr与data.tables的性能）。 这是一个如何工作的示例（利用管道%>%可以更轻松地了解正在发生的事情）：

DF <- data.frame(SampleID = c(1,1,1,1,1,2,2),
                 Chr = c(1,1,1,1,2,1,1),
                 Start = c(1, 101, 201, 401, 500, 1, 101),
                 End = c(100, 200, 300, 499, 599, 100, 200),
                 State = c(3,3,2,3,3,2,2)
)

library(dplyr)
# take your data frame
DF %>% 
  # group it by the subsets
  group_by(SampleID, Chr, State) %>% 
  # operate on each group
  do(
    # turn subset into a GRanges object
    as(as.data.frame(.), "GRanges") %>%
      # reducae ranges
      GenomicRanges::reduce() %>% 
      # turn back into data frame for dplyr to stitch together
      as.data.frame() %>% 
      # get the information you want
      select(start, end, width)
  ) %>% 
  # ungroup for future operations
  ungroup() %>% 
  # sort by what makes most sense for your set
  arrange(SampleID, Chr, start)

输出：

Source: local data frame [5 x 6]

SampleID Chr State start end width
(dbl) (dbl) (dbl) (int) (int) (int)
 1     1     3     1   200   200
 1     1     2   201   300   100
 1     1     3   401   499    99
 1     2     3   500   599   100
 2     1     2     1   200   200

Answer 3

# This code is kind of robust but it appears to get the job done

DF <- data.frame(SampleID = c(1,1,1,1,1,2,2),
                 Chr = c(1,1,1,1,2,1,1),
                 Start = c(1, 101, 201, 401, 500, 1, 101),
                 End = c(100, 200, 300, 499, 599, 100, 200),
                 State = c(3,3,2,3,3,2,2)
)

test_and_combine <- function(r1,r2) {
  if (r1[,1] == r2[,1] & # check if "SampleID" column matches
      r1[,2] == r2[,2] & # check if  "Chr" column matches
      (r1[,4] + 1) == r2[,3] & # test if Start and End are in sequence
      r1[,5] == r2[,5]) # check if "State"column matches
    {
    # merge rows if true
    DF_comb <- r1[,]
    DF_comb[1,4] <- r2[,4]

  }
  else{
    DF_comb <- NA 
  }
  return(DF_comb)
}

# This section could rewritten to use Reduce()
DF_comb_final <- data.frame()
for(i in 1:(nrow(DF)-1)){ # loop through ever row of data.frame
  DF_temp <- test_and_combine(DF[i,],DF[i+1,]) # send two rows to function
  if(!any(is.na(DF_temp))){
    DF_comb_final <- rbind(DF_comb_final,DF_temp)    
  }
}

R-比较data.frame的行（以根据条件组合）

问题描述

3 个解决方案

解决方案1
4 已采纳 2016-04-14 22:10:41

解决方案2
2 2016-04-14 21:49:50

解决方案3
1 2016-04-14 21:51:21

R-比较data.frame的行（以根据条件组合）

问题描述

3 个解决方案

解决方案1 4 已采纳 2016-04-14 22:10:41

解决方案2 2 2016-04-14 21:49:50

解决方案3 1 2016-04-14 21:51:21

解决方案1
4 已采纳 2016-04-14 22:10:41

解决方案2
2 2016-04-14 21:49:50

解决方案3
1 2016-04-14 21:51:21