查找 R 中數字對之間的層次關系

Question

我想找到一種有效的方法來確定數字對表的整個層次類型關系，然后在向量或字符串中表達這種關系，這樣我就可以確定關於每對層次結構的其他有用信息，例如最高相關integer，最低相關的 integer 和相關整數的總數。

例如，我有一個 integer 對表：

如果一對中的任何值被另一對中的任何其他值共享，則一條記錄與另一條記錄相關。 決賽桌看起來像這樣：

  X    Y    Related ID's
  ---  ---  ---------------   
  5    10    3,5,10,11,12,13 
  5    11    3,5,10,11,12,13 
  11   12    3,5,10,11,12,13 
  11   13    3,5,10,11,12,13 
  13   3     3,5,10,11,12,13 
  20   18    17,18,20,21,50
  17   18    17,18,20,21,50
  50   18    17,18,20,21,50
  20   21    17,18,20,21,50

我現在所擁有的無疑是一團糟。 它使用了一個模糊連接和一個匹配的 function，它將 x,y 作為向量並在它們之間進行匹配。 然后該匹配會創建一個包含所有四個匹配數字的更大向量，該向量會返回到fuzzy_join 以再次進行匹配。 這個循環直到沒有更多的匹配。 它很快變得很糟糕，並且在大約 4k 記錄時它不再響應了。 整個初始配對表將保持 < 100k 條記錄

Answer 1

在基礎 R 你可以這樣做：

relation <- function(dat){
  .relation <- function(x){
    k = unique(sort(c(dat[dat[, 1] %in% x, 2], x, dat[dat[, 2] %in% x, 1])))
    if(setequal(x,k)) toString(k) else .relation(k)}
  sapply(dat[,1],.relation)
}

df$related <- relation(df)

df
   X  Y              related
1  5 10 3, 5, 10, 11, 12, 13
2  5 11 3, 5, 10, 11, 12, 13
3 11 12 3, 5, 10, 11, 12, 13
4 11 13 3, 5, 10, 11, 12, 13
5 13  3 3, 5, 10, 11, 12, 13
6 20 18   17, 18, 20, 21, 50
7 17 18   17, 18, 20, 21, 50
8 50 18   17, 18, 20, 21, 50
9 20 21   17, 18, 20, 21, 50

如果您安裝了igraph ，您可以執行以下操作：

library(igraph)
a <- components(graph_from_data_frame(df, FALSE))$membership
b <- tapply(names(a),a,toString)
df$related <- b[a[as.character(df$X)]]

編輯：

如果我們要比較函數的速度，那么請注意，在我上面的 function 中，最后一條語句，即sapply(dat[,1], ...)計算每個元素的值，即使在之前計算它之后也是如此。 例如sapply(c(5,5,5,5)...)將計算組 4 次而不是一次。 現在使用：

relation <- function(dat){
  .relation <- function(x){
    k <- unique(c(dat[dat[, 1] %in% x, 2], x, dat[dat[, 2] %in% x, 1]))
    if(setequal(x,k)) sort(k) else .relation(k)}
  d <- unique(dat[,1])
  m <- setNames(character(length(d)),d)
  while(length(d) > 0){
    s <- .relation(d[1])
    m[as.character(s)] <- toString(s)
    d <- d[!d%in%s]
  }
  dat$groups <- m[as.character(dat[,1])]
  dat
}

現在做基准測試：

 df1 <- do.call(rbind,rep(list(df),100))
 microbenchmark::microbenchmark(relation(df1), group_pairs(df1),unit = "relative")


 microbenchmark::microbenchmark(relation(df1), group_pairs(df1))
Unit: milliseconds
             expr      min        lq       mean    median       uq      max neval
    relation(df1)   1.0909   1.17175   1.499096   1.27145   1.6580   3.2062   100
 group_pairs(df1) 153.3965 173.54265 199.559206 190.62030 213.7964 424.8309   100

Answer 2

igraph的另一個選擇

library(igraph)
clt <- clusters(graph_from_data_frame(df,directed = FALSE))$membership
within(df, ID <- ave(names(clt),clt,FUN = toString)[match(as.character(X),names(clt))])

這樣

   X  Y                   ID
1  5 10 5, 11, 13, 10, 12, 3
2  5 11 5, 11, 13, 10, 12, 3
3 11 12 5, 11, 13, 10, 12, 3
4 11 13 5, 11, 13, 10, 12, 3
5 13  3 5, 11, 13, 10, 12, 3
6 20 18   20, 17, 50, 18, 21
7 17 18   20, 17, 50, 18, 21
8 50 18   20, 17, 50, 18, 21
9 20 21   20, 17, 50, 18, 21

Answer 3

這遠沒有 Onyambu 的基本 R 答案那么優雅，但我將其基准測試為快 4 或 5 倍。 它的工作原理是將每一行分配給一個組，將其內容添加到該組中所有數字的集合中，然后找到下一個未分配的行，該行中至少有一個成員。 一旦沒有更多匹配的行，它就會跳轉到下一個未分配的行。

group_pairs <- function(df)
{
  df$ID <- numeric(nrow(df))
  ID <- 1
  row <- 1
  current_set <- numeric()
  
  while(any(df$ID == 0))
  {
    
    df$ID[row]  <- ID
    current_set <- unique(c(current_set, df$x[row], df$y[row]))
    nextrows    <- c(which(df$x %in% current_set & df$ID == 0), 
                     which(df$y %in% current_set & df$ID == 0))
    if (length(nextrows) > 0)
    {
      row <- unique(nextrows)[1]
    }
    else
    {
      ID <- ID + 1
      row <- which(df$ID == 0)[1]
      current_set <- numeric()
    }
  }
  
  df$ID <- sapply(split(df[-3], df$ID), 
                  function(i) paste(sort(unique(unlist(i))), collapse = ", "))[df$ID]
  df
}

所以你可以這樣做：

group_pairs(df)
#>    x  y                   ID
#> 1  5 10 3, 5, 10, 11, 12, 13
#> 2  5 11 3, 5, 10, 11, 12, 13
#> 3 11 12 3, 5, 10, 11, 12, 13
#> 4 11 13 3, 5, 10, 11, 12, 13
#> 5 13  3 3, 5, 10, 11, 12, 13
#> 6 20 18   17, 18, 20, 21, 50
#> 7 17 18   17, 18, 20, 21, 50
#> 8 50 18   17, 18, 20, 21, 50
#> 9 20 21   17, 18, 20, 21, 50

和

microbenchmark::microbenchmark(relation(df), group_pairs(df))
#> Unit: milliseconds
#>             expr      min       lq     mean   median       uq      max neval cld
#>     relation(df) 4.535100 5.027551 5.737164 5.829652 6.256301 7.669001   100   b
#>  group_pairs(df) 1.022502 1.159601 1.398604 1.338501 1.458950 8.903800   100  a

Answer 4

我認為您也可以僅在tidyverse中執行此類操作（我使用的是經過精心設計的 dataframe 並添加了幾行）。 該策略將繼續accumulate （累積）related_ids。 這里的id只是一個 rowid，沒有任何特殊用途。 你也可以放棄這一步。

df <- data.frame(X = c(5,5,11,11,13,20, 17,50, 20, 5, 1, 17),
                 Y = c(10, 11, 12, 13, 3, 18, 18, 18, 21, 13, 2, 50))

library(tidyverse)

df %>% arrange(pmax(X, Y)) %>% 
  mutate(id = row_number()) %>% rowwise() %>%
  mutate(related_ids = list(c(X, Y))) %>% ungroup() %>%
  mutate(related_ids = accumulate(related_ids, ~if(any(.y %in% .x)){union(.x, .y)} else {.y})) %>%
  as.data.frame()
#>     X  Y id          related_ids
#> 1   1  2  1                 1, 2
#> 2   5 10  2                5, 10
#> 3   5 11  3            5, 10, 11
#> 4  11 12  4        5, 10, 11, 12
#> 5  11 13  5    5, 10, 11, 12, 13
#> 6  13  3  6 5, 10, 11, 12, 13, 3
#> 7   5 13  7 5, 10, 11, 12, 13, 3
#> 8  17 18  8               17, 18
#> 9  20 18  9           17, 18, 20
#> 10 20 21 10       17, 18, 20, 21
#> 11 50 18 11   17, 18, 20, 21, 50
#> 12 17 50 12   17, 18, 20, 21, 50

^{由reprex package (v2.0.0) 於 2021 年 6 月 1 日創建}

Answer 5

更新我認為您也可以使用以下解決方案。 它相當冗長，但我認為它非常有效：

library(dplyr)

# First I created an id column to be able to group the observations with any duplicated 
# values
df %>%
  arrange(X, Y) %>%
  mutate(dup = ifelse((X == lag(X, default = 0) | X == lag(Y, default = 0)) |
                        (Y == lag(X, default = 0) | Y == lag(Y, default = 0)) |
                        (X == lag(X, n = 2L, default = 0) | Y == lag(Y, n = 2L, default = 0)) |
                        (X == lag(Y, n = 2L, default = 0) | Y == lag(X, n = 2L, default = 0)) |
                        (X == lag(Y, n = 3L, default = 0) | Y == lag(X, n = 3L, default = 0)) |
                        (X == lag(X, n = 3L, default = 0) | Y == lag(Y, n = 3L, default = 0)), 1, 0)) %>%
  mutate(id = cumsum(dup == 0)) %>%
  select(-dup) -> df1


df1 %>%
  group_by(id) %>%
  pivot_longer(c(X, Y), names_to = "Name", values_to = "Val") %>%
  arrange(Val) %>%
  mutate(dup = Val == lag(Val, default = 10000)) %>%
  filter(!dup) %>%
  mutate(across(Val, ~ paste(.x, collapse = "-"))) %>%
  select(-dup) %>%
  slice(2:n()) %>%
  select(-Name) %>%
  right_join(df1, by = "id") %>%
  group_by(Val, X, Y) %>%
  distinct() %>%
  select(-id) %>%
  relocate(X, Y)

# A tibble: 9 x 3
# Groups:   Val, X, Y [9]
      X     Y Val            
  <int> <int> <chr>          
1     5    10 3-5-10-11-12-13
2     5    11 3-5-10-11-12-13
3    11    12 3-5-10-11-12-13
4    11    13 3-5-10-11-12-13
5    13     3 3-5-10-11-12-13
6    17    18 17-18-20-21-50 
7    20    18 17-18-20-21-50 
8    20    21 17-18-20-21-50 
9    50    18 17-18-20-21-50

我還嘗試了@AnilGoyal 精心制作的數據框：

# A tibble: 12 x 3
# Groups:   Val, X, Y [12]
       X     Y Val            
   <dbl> <dbl> <chr>          
 1     1     2 1-2            
 2     5    10 3-5-10-11-12-13
 3     5    11 3-5-10-11-12-13
 4     5    13 3-5-10-11-12-13
 5    11    12 3-5-10-11-12-13
 6    11    13 3-5-10-11-12-13
 7    13     3 3-5-10-11-12-13
 8    17    18 17-18-20-21-50 
 9    17    50 17-18-20-21-50 
10    20    18 17-18-20-21-50 
11    20    21 17-18-20-21-50 
12    50    18 17-18-20-21-50

查找 R 中數字對之間的層次關系

問題描述

5 個解決方案

解決方案1
7 已采納 2020-08-07 16:17:58

解決方案2
2 2020-08-07 17:29:29

解決方案3
1 2020-08-07 16:43:20

解決方案4
1 2021-06-01 13:48:17

解決方案5
1 2021-06-01 16:04:54

查找 R 中數字對之間的層次關系

問題描述

5 個解決方案

解決方案1 7 已采納 2020-08-07 16:17:58

解決方案2 2 2020-08-07 17:29:29

解決方案3 1 2020-08-07 16:43:20

解決方案4 1 2021-06-01 13:48:17

解決方案5 1 2021-06-01 16:04:54

解決方案1
7 已采納 2020-08-07 16:17:58

解決方案2
2 2020-08-07 17:29:29

解決方案3
1 2020-08-07 16:43:20

解決方案4
1 2021-06-01 13:48:17

解決方案5
1 2021-06-01 16:04:54