I have a dataframe which looks like this :
Sn id1 id2 id3
1 abc 123 NA
2 xyz 111 vvv
3 qwe 222 vvv
4 rty NA NA
5 abc NA NA
6 ddd 234 NA
7 sss 222 NA
8 aaa NA NA
Now I want to create a new column 'output' based on following logic :
First level of relationship : All the entities where even a single id
matches ( NA
's do not count) must be assigned same id
.
Second level of relationship : If 2 is connected to 3 and 3 is connected to 7, then 2,3 and 7 all must have same id
.
Hence the output here would be :
Sn id1 id2 id3 id4
1 abc 123 NA 100001
2 xyz 111 vvv 100002
3 qwe 222 vvv 100002
4 rty NA NA 100003
5 abc NA NA 100001
6 ddd 234 NA 100004
7 sss 222 NA 100002
8 aaa NA NA 100005
Please let me know what is the easiest way to do this. Any thoughts are welcome.
I am currently thinking of creating a 8*8 matrix which will contain a flag to indicate if there is any match between the two entities(rows).
I like to do tasks like this (with "connected" nodes) using igraph
. So if we start with your sample data in a copy/paste-friendly data.frame format
dd <- data.frame(
Sn = 1:8,
id1 = c("abc", "xyz", "qwe", "rty", "abc", "ddd", "sss", "aaa"),
id2 = c(123L, 111L, 222L, NA, NA, 234L, 222L, NA),
id3 = c(NA, "vvv", "vvv", NA, NA, NA, NA, NA),
stringsAsFactors=F
)
Now the first step is to build an edge list connectin all the nodes on a given row
el <- rbind(
setNames(dd[,2:3], c("A","B")),
setNames(dd[,3:4], c("A","B"))
)
el <- el[complete.cases(el),] #(ignore NA)
And we also need a unique list of all the vertext names
vx <- na.omit(unique(unlist(dd[, 2:4])))
Now we can create a graph object
library(igraph)
gg<-graph.data.frame(el, vertices=vx, directed=F)
plot(gg)
We can then use the cluster()
function to find the different groups and get a group number for each vertex
newid <- data.frame(
vertex=V(gg)$name,
grp=clusters(gg)$membership
)
Now, if we want to assign that back to the original data.frame, we really just need to match on the id1
column.
dd$id4 <- newid$grp[match(dd$id1, newid$vertex)]+100000
dd
# Sn id1 id2 id3 id4
# 1 1 abc 123 <NA> 100001
# 2 2 xyz 111 vvv 100002
# 3 3 qwe 222 vvv 100002
# 4 4 rty NA <NA> 100003
# 5 5 abc NA <NA> 100001
# 6 6 ddd 234 <NA> 100004
# 7 7 sss 222 <NA> 100002
# 8 8 aaa NA <NA> 100005
and we get the results you desired.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.