[英]In data.table, compare rows and put a group number to each row based on this comparison without loop in R
I have a data.table with several columns and within these columns, a column is designed to received the group number of each row (group_id).我有一个包含几列的 data.table,在这些列中,一列被设计为接收每行的组号 (group_id)。 At the beginning, the column group_id is filled with the number of the row.
一开始,group_id 列填充的是行号。 The value on the column may be at 'NA' except for the group_id column.
除了 group_id 列之外,列上的值可能为“NA”。 I have a huge data set so I can't do that using loop because it's too slow.
我有一个庞大的数据集,所以我不能使用循环来做到这一点,因为它太慢了。
Here a small example of what is my data.table at first :这是我的 data.table 的一个小例子:
library(data.table)
group_id = c(1,2,3,4)
type1 = c(1,5,7,5)
type2 = c(3,3,4,NA)
type3 = c(6,7,NA,NA)
df <- data.table(group_id, type1, type2, type3)
df
group_id type1 type2 type3
1: 1 1 3 6
2: 2 5 3 7
3: 3 7 4 NA
4: 4 5 NA NA
What I want to achieve is to change the group_id based on the comparison (if it's equal) of each row with the other by column.我想要实现的是根据每行与其他列的比较(如果相等)来更改 group_id。 The lower group_id value with always be the one kept.
始终保留较低的 group_id 值。 For the previous example, the result expected would be :
对于前面的例子,预期的结果是:
group_id type1 type2 type3
1: 1 1 3 6
2: 1 5 3 7
3: 3 7 4 NA
4: 1 5 NA NA
This is my second question here so if I made mistakes don't hesitate to tell me where, thank you.这是我在这里的第二个问题,所以如果我犯了错误,请不要犹豫,告诉我在哪里,谢谢。
Here is a possible approach using igraph
.这是使用
igraph
一种可能方法。
For each ^type*
column, remove NAs first.对于每个
^type*
列,首先删除 NA。 Then for each unique type value within this ^type*
column, create a network where each vertex is joined to every other vertex (ie a full citation graph).然后对于此
^type*
列中的每个唯一类型值,创建一个网络,其中每个顶点都连接到每个其他顶点(即完整的引用图)。
Then, union all these sub-networks to create clusters where group_id
s within the same cluster share one or more identical type values.然后,联合所有这些子网络以创建集群,其中同一集群中的
group_id
共享一个或多个相同的类型值。
Next, find the earliest group_id
within each cluster.接下来,找到每个集群中最早的
group_id
。
Finally, lookup the cluster that each group_id
is in.最后,查找每个
group_id
所在的集群。
library(igraph)
cols <- paste0("type", 1:3)
lg <- list()
#for each type column
for (x in cols) {
lg <- c(lg, DT[!is.na(get(x)), #remove NAs
{
#create graph and label vertices
gix <- unique(group_id)
cg <- make_full_citation_graph(length(gix), FALSE)
V(cg)$name <- as.character(gix)
.(.(cg))
},
by=x]$V1)
}
#union all subgraphs
ug <- do.call(union, c(lg, list(byname=TRUE)))
#plot(ug)
#find the earliest group_id for each cluster
clu <- clusters(ug)$membership
split(clu, clu) <- lapply(split(clu, clu), function(x) min(names(x)))
#lookup to update the original dataset
DT[, new_gid := clu[as.character(group_id)]]
DT
output:输出:
group_id type1 type2 type3 new_gid
1: 1 1 3 6 1
2: 2 5 3 7 1
3: 3 7 4 NA 3
4: 4 5 NA NA 1
data:数据:
library(data.table)
group_id = c(1,2,3,4)
type1 = c(1,5,7,5)
type2 = c(3,3,4,NA)
type3 = c(6,7,NA,NA)
DT <- data.table(group_id, type1, type2, type3)
edit: probably overkill on using igraph
.编辑:使用
igraph
可能有点矫枉过正。 This Rcpp version should be faster这个 Rcpp 版本应该更快
library(Rcpp)
cppFunction("
IntegerVector gclu(IntegerVector id, IntegerVector typ1, IntegerVector typ2, IntegerVector typ3) {
int i, j, sz = id.size();
for (i=0; i<sz; i++) {
for (j=0; j<=i; j++) {
if ((!IntegerVector::is_na(typ1[i]) && !IntegerVector::is_na(typ1[j]) && typ1[i]==typ1[j]) ||
(!IntegerVector::is_na(typ2[i]) && !IntegerVector::is_na(typ2[j]) && typ2[i]==typ2[j]) ||
(!IntegerVector::is_na(typ3[i]) && !IntegerVector::is_na(typ3[j]) && typ3[i]==typ3[j])) {
id[i] = id[j];
break;
}
}
}
return(id);
}
")
DT[, gclu(group_id, type1, type2, type3)]
Here is the way I achieve to group the row, it's might be clumsy, I'm new in R coding :这是我实现对行进行分组的方式,它可能很笨拙,我是 R 编码的新手:
group <- function(table) {
# Index in table
group.id <- 1
# Length of table ie number of row
length.table <- length(table[[1]])
# Loop on the table O(n) = n(n-1)/2 with n the number of row
for (i in 1:(length.table - 1)) {
for (j in (i + 1):length.table) {
# Go to the next comparison if the two row are already grouped
if (table[[group.id]][i] == table[[group.id]][j]) {
next
} else {
for (k in (group.id + 1):length.table) {
# If the two value are equal (and not NA)
if (!is.na(table[[k]][i]) &
!is.na(table[[k]][j]) &
table[[k]][i] == table[[k]][j]) {
# Then group them with the lesser value of group.id
if (table[[group.id]][i] < table[[group.id]][j]) {
table[[group.id]][j] <- table[[group.id]][i]
} else {
table[[group.id]][i] <- table[[group.id]][j]
}
}
}
}
}
# If all the row are grouped then return the result
if (uniqueN(table[[group.id]]) == 1) {
return(table[[group.id]])
}
}
return(table[[group.id]])
}
dt[, group.id := group(c(list(group.id, type1, type2, type3)))]
print(dt)
Output :输出 :
group.id type1 type2 type3
1: 1 1 3 6
2: 1 5 3 7
3: 3 7 4 NA
4: 1 5 NA NA
Data :数据 :
library(data.table)
group.id <- c(1, 2, 3, 4)
type1 <- c(1, 5, 7, 5)
type2 <- c(3, 3, 4, NA)
type3 <- c(6, 7, NA, NA)
dt <- data.table(group.id, type1, type2, type3)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.