简体   繁体   English

在data.table中,比较行并根据此比较将组号放入每行,R中没有循环

[英]In data.table, compare rows and put a group number to each row based on this comparison without loop in R

I have a data.table with several columns and within these columns, a column is designed to received the group number of each row (group_id).我有一个包含几列的 data.table,在这些列中,一列被设计为接收每行的组号 (group_id)。 At the beginning, the column group_id is filled with the number of the row.一开始,group_id 列填充的是行号。 The value on the column may be at 'NA' except for the group_id column.除了 group_id 列之外,列上的值可能为“NA”。 I have a huge data set so I can't do that using loop because it's too slow.我有一个庞大的数据集,所以我不能使用循环来做到这一点,因为它太慢了。

Here a small example of what is my data.table at first :这是我的 data.table 的一个小例子:

library(data.table)
group_id = c(1,2,3,4)
type1 = c(1,5,7,5)
type2 = c(3,3,4,NA)
type3 = c(6,7,NA,NA)
df <- data.table(group_id, type1, type2, type3)
df

   group_id type1 type2 type3
1:        1     1     3     6
2:        2     5     3     7
3:        3     7     4    NA
4:        4     5    NA    NA

What I want to achieve is to change the group_id based on the comparison (if it's equal) of each row with the other by column.我想要实现的是根据每行与其他列的比较(如果相等)来更改 group_id。 The lower group_id value with always be the one kept.始终保留较低的 group_id 值。 For the previous example, the result expected would be :对于前面的例子,预期的结果是:

   group_id type1 type2 type3
1:        1     1     3     6
2:        1     5     3     7
3:        3     7     4    NA
4:        1     5    NA    NA
  • group_id for the row 2 is changed to 1 because the row 2 shares the same type2 value (type2==3)第 2 行的 group_id 更改为 1,因为第 2 行共享相同的 type2 值 (type2==3)
  • group_id for the row 3 is unchanged第 3 行的 group_id 未更改
  • group_id for the row 4 is changed to 1 because the row 4 shares the same type1 value (type1==5)第 4 行的 group_id 更改为 1,因为第 4 行共享相同的 type1 值 (type1==5)

This is my second question here so if I made mistakes don't hesitate to tell me where, thank you.这是我在这里的第二个问题,所以如果我犯了错误,请不要犹豫,告诉我在哪里,谢谢。

Here is a possible approach using igraph .这是使用igraph一种可能方法。

For each ^type* column, remove NAs first.对于每个^type*列,首先删除 NA。 Then for each unique type value within this ^type* column, create a network where each vertex is joined to every other vertex (ie a full citation graph).然后对于此^type*列中的每个唯一类型值,创建一个网络,其中每个顶点都连接到每个其他顶点(即完整的引用图)。

Then, union all these sub-networks to create clusters where group_id s within the same cluster share one or more identical type values.然后,联合所有这些子网络以创建集群,其中同一集群中的group_id共享一个或多个相同的类型值。

Next, find the earliest group_id within each cluster.接下来,找到每个集群中最早的group_id

Finally, lookup the cluster that each group_id is in.最后,查找每个group_id所在的集群。

library(igraph)
cols <- paste0("type", 1:3)
lg <- list()

#for each type column
for (x in cols) {
    lg <- c(lg, DT[!is.na(get(x)), #remove NAs
        {
            #create graph and label vertices
            gix <- unique(group_id)
            cg <- make_full_citation_graph(length(gix), FALSE)
            V(cg)$name <- as.character(gix)
            .(.(cg))
        }, 
        by=x]$V1)
}

#union all subgraphs
ug <- do.call(union, c(lg, list(byname=TRUE)))
#plot(ug)

#find the earliest group_id for each cluster
clu <- clusters(ug)$membership
split(clu, clu) <- lapply(split(clu, clu), function(x) min(names(x)))

#lookup to update the original dataset
DT[, new_gid := clu[as.character(group_id)]]
DT

output:输出:

   group_id type1 type2 type3 new_gid
1:        1     1     3     6       1
2:        2     5     3     7       1
3:        3     7     4    NA       3
4:        4     5    NA    NA       1

data:数据:

library(data.table)
group_id = c(1,2,3,4)
type1 = c(1,5,7,5)
type2 = c(3,3,4,NA)
type3 = c(6,7,NA,NA)
DT <- data.table(group_id, type1, type2, type3)

edit: probably overkill on using igraph .编辑:使用igraph可能有点矫枉过正。 This Rcpp version should be faster这个 Rcpp 版本应该更快

library(Rcpp)
cppFunction("
IntegerVector gclu(IntegerVector id, IntegerVector typ1, IntegerVector typ2, IntegerVector typ3) {
    int i, j, sz = id.size();

    for (i=0; i<sz; i++) {
        for (j=0; j<=i; j++) {
            if ((!IntegerVector::is_na(typ1[i]) && !IntegerVector::is_na(typ1[j]) && typ1[i]==typ1[j]) ||
                (!IntegerVector::is_na(typ2[i]) && !IntegerVector::is_na(typ2[j]) && typ2[i]==typ2[j]) ||
                (!IntegerVector::is_na(typ3[i]) && !IntegerVector::is_na(typ3[j]) && typ3[i]==typ3[j])) {

                id[i] = id[j];
                break;
            }
        }
    }

    return(id);
}
")
DT[, gclu(group_id, type1, type2, type3)]

Here is the way I achieve to group the row, it's might be clumsy, I'm new in R coding :这是我实现对行进行分组的方式,它可能很笨拙,我是 R 编码的新手:

group <- function(table) {
  # Index in table
  group.id <- 1

  # Length of table ie number of row
  length.table <- length(table[[1]])

  # Loop on the table O(n) = n(n-1)/2 with n the number of row
  for (i in 1:(length.table - 1)) {
    for (j in (i + 1):length.table) {
      # Go to the next comparison if the two row are already grouped
      if (table[[group.id]][i] == table[[group.id]][j]) {
        next
      } else {
        for (k in (group.id + 1):length.table) {
          # If the two value are equal (and not NA)
          if (!is.na(table[[k]][i]) &
              !is.na(table[[k]][j]) &
              table[[k]][i] == table[[k]][j]) {
            # Then group them with the lesser value of group.id
            if (table[[group.id]][i] < table[[group.id]][j]) {
              table[[group.id]][j] <- table[[group.id]][i]
            } else {
              table[[group.id]][i] <- table[[group.id]][j]
            }
          }
        }
      }
    }
    # If all the row are grouped then return the result
    if (uniqueN(table[[group.id]]) == 1) {
      return(table[[group.id]])
    } 
  }
  return(table[[group.id]])
}

dt[, group.id := group(c(list(group.id, type1, type2, type3)))]

print(dt)

Output :输出 :

   group.id type1 type2 type3
1:        1     1     3     6
2:        1     5     3     7
3:        3     7     4    NA
4:        1     5    NA    NA

Data :数据 :

library(data.table)
group.id <- c(1, 2, 3, 4)
type1 <- c(1, 5, 7, 5)
type2 <- c(3, 3, 4, NA)
type3 <- c(6, 7, NA, NA)
dt <- data.table(group.id, type1, type2, type3)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM