R 按重疊范圍分組

Question

我有一個數據框，其中行包含范圍。 我想確定范圍組，其中每個范圍與組中至少一個其他行重疊超過 75%。 分組應作為索引變量添加到原始數據中。

示例數據如下：

df <- data.frame(label = c("A", "B", "C", "D", "E", "F"),
                 start = c(16, 18, 37, 62, 15, 45),
                 stop = c(22, 24, 55, 66, 23, 55))

生成的 df 應如下所示：

label       start   stop   ID
"A"         6       22     1
"B"         6       24     1
"C"         37      55     2
"D"         62      66     3
"E"         15      23     1
"F"         45      55     2

首先，我嘗試了帶有mutate和lag的dplyr選項，但是隨后分組取決於行的順序，並且不適用於所有情況。 接下來我嘗試了一個帶有seq_along的 for 循環，但我無法解決問題。 希望你們中的一個可以...

Answer 1

overlap <- function(A, B) {
  shared <- pmax(0, min(A[2], B[2]) - max(A[1], B[1]))
  max(shared / c(diff(A), diff(B)))
}

eg <- expand.grid(a = seq_len(nrow(df)), b = seq_len(nrow(df)))
eg <- eg[eg$a < eg$b,]

together <- cbind(
  setNames(df[eg$a,], paste0(names(df), "1")),
  setNames(df[eg$b,], paste0(names(df), "2"))
)
together <- within(together, {
  shared  = pmax(0, pmin(stop1, stop2) - pmax(start1, start2))
  overlap = pmax(shared / (stop1 - start1), shared / (stop2 - start2))
})[, c("label1", "label2", "overlap")]

bigenough <- together[together$overlap >= 0.75,]
groups <- split(bigenough$label2, bigenough$label1)

for (ltr in df$label) {
  ind <- (ltr == names(groups)) | sapply(groups, `%in%`, x = ltr)
  groups <- c(
    setNames(list(unique(c(ltr, names(groups[ind]), unlist(groups[ind])))), ltr),
    groups[!ind]
  )
}

groups <- data.frame(
  ID = rep(seq_along(groups), lengths(groups)),
  label = unlist(groups)
)

結果：

merge(df, groups, by = "label")
#   label start stop ID
# 1     A    16   22  2
# 2     B    18   24  2
# 3     C    37   55  1
# 4     D    62   66  3
# 5     E    15   23  2
# 6     F    45   55  1

您要求一種不使用for循環的方法。 由於我們需要一個（循環的）迭代來處理前一次迭代的結果，所以lapply對我們不起作用。 但是，我們可以使用Reduce ：

# groups <- split(...)

groups <- Reduce(function(grps, ltr) {
  ind <- (ltr == names(grps)) | sapply(grps, `%in%`, x = ltr)
  c(setNames(list(unique(c(ltr, names(grps[ind]), unlist(grps[ind])))), ltr),
    grps[!ind])
}, df$label, init = groups)
# $F
# [1] "F" "C"
# $E
# [1] "E" "B" "A"
# $D
# [1] "D"

# groups <- data.frame(ID = ...)
# merge(df, groups, ...)

（然后最后一groups <- data.frame(..)從上面調用）。 這同樣有效。 唯一的問題是Reduce for （ https://github.com/wch/r-source/blob/d22ee2fc0dc8142b23eed9f46edf76ea9d3ca69a/src/library/base/R/funprog.ZE1E1D3D405-AF1283DEE6 ）

R 按重疊范圍分組

問題描述

1 個解決方案

解決方案1
2 已采納 2021-01-27 16:49:03

R 按重疊范圍分組

問題描述

1 個解決方案

解決方案1 2 已采納 2021-01-27 16:49:03

解決方案1
2 已采納 2021-01-27 16:49:03