简体   繁体   English

通过跨行比较两个变量的值来创建组 ID:在 R

[英]Creating group ids by comparing values of two variables across rows: in R

I have a dataframe with two variables ( start , end ).我有一个 dataframe 有两个变量( startend )。 would like to create an identifier variable which grows in ascending order of start and, most importantly, is kept constant if the value of start coincides with end of any other row in the dataframe.想要创建一个标识符变量,该变量以start的升序增长,最重要的是,如果start的值与 dataframe 中任何其他行end重合,则保持不变。

Below is a simple example of the data下面是一个简单的数据示例

toy_data <- data.frame(start = c(1,5,6,10,16),
                      end = c(10,9,11,15,17))

The output I would be looking for is the following:我要寻找的 output 如下:

output_data <- data.frame(start = c(1,10,5,6,16),
                   end = c(10,15,9,11,17),
                   NEW_VAR = c(1,1,2,3,4))

The following function should give you the desired identifier variable NEW_VAR .以下 function 应该为您提供所需的标识符变量NEW_VAR

identifier <- \(df) {
  x <- array(0L, dim = nrow(df))
  count <- 0L
  my_seq <- seq_len(nrow(df))
  for (i in my_seq) {
    if(!df[i,]$start %in% df$end) {
      x[i] <- my_seq[i] + count
    } else {
      x[i] <- my_seq[i]-1L + count
      count <- count - 1L
    }
  }
  x
}

Example例子

toy_data <- data.frame(start = c(1, 2, 2, 4, 16, 21, 18, 3),
                       end = c(16, 2, 21, 2, 2, 2, 3, 1))
toy_data$NEW_VAR <- identifier(toy_data)
# ---------------------
> toy_data$NEW_VAR
[1] 0 0 0 1 1 1 2 2

You could try adapting this answer to group by ranges that are adjacent to each other.您可以尝试将此答案调整为按彼此相邻的范围分组。 Credit goes entirely to @r2evans.功劳完全归功于@r2evans。

In this case, you would use expand.grid to get combinations of start and end .在这种情况下,您将使用expand.grid来获取startend的组合。 Instead of labels you would have row numbers rn to reference.您将使用行号rn来引用而不是标签。

In the end, you can number the groups based on which rows appear together in the list.最后,您可以根据列表中一起出现的行对组进行编号。 The last few lines starting with enframe use tibble / tidyverse .enframe开头的最后几行使用tibble / tidyverse To match the group numbers I resorted the results too.为了匹配组号,我也使用了结果。

I hope this might be helpful.我希望这可能会有所帮助。

library(tidyverse)

toy_data <- data.frame(start = c(1,5,6,10,16),
                       end = c(10,9,11,15,17))

toy_data$rn = 1:nrow(toy_data)

eg <- expand.grid(a = seq_len(nrow(toy_data)), b = seq_len(nrow(toy_data)))
eg <- eg[eg$a < eg$b,]

together <- cbind(
  setNames(toy_data[eg$a,], paste0(names(toy_data), "1")),
  setNames(toy_data[eg$b,], paste0(names(toy_data), "2"))
)

together <- subset(together, end1 == start2)

groups <- split(together$rn2, together$rn1)

for (i in toy_data$rn) {
  ind <- (i == names(groups)) | sapply(groups, `%in%`, x = i)
  vals <- groups[ind]
  groups <- c(
    setNames(list(unique(c(i, names(vals), unlist(vals)))), i),
    groups[!ind]
  )
}

min_row <- as.numeric(sapply(groups, min))
ctr <- seq_along(groups)

lapply(ctr[order(match(min_row, ctr))], \(x) toy_data[toy_data$rn %in% groups[[x]], ]) %>%
  enframe() %>%
  unnest(col = value) %>%
  select(-rn)

Output Output

   name start   end
  <int> <dbl> <dbl>
1     1     1    10
2     1    10    15
3     2     5     9
4     3     6    11
5     4    16    17

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM