简体   繁体   中英

How to assign a common id based on other columns/id's in R?

I have a dataframe which looks like this :

Sn  id1 id2 id3    
1   abc 123  NA   
2   xyz 111 vvv  
3   qwe 222 vvv    
4   rty  NA  NA    
5   abc  NA  NA    
6   ddd 234  NA   
7   sss 222  NA   
8   aaa  NA  NA

Now I want to create a new column 'output' based on following logic :

First level of relationship : All the entities where even a single id matches ( NA 's do not count) must be assigned same id .

Second level of relationship : If 2 is connected to 3 and 3 is connected to 7, then 2,3 and 7 all must have same id .

Hence the output here would be :

Sn  id1 id2 id3 id4    
1   abc 123 NA  100001   
2   xyz 111 vvv 100002  
3   qwe 222 vvv 100002  
4   rty NA  NA  100003    
5   abc NA  NA  100001    
6   ddd 234 NA  100004    
7   sss 222 NA  100002   
8   aaa NA  NA  100005

Please let me know what is the easiest way to do this. Any thoughts are welcome.

I am currently thinking of creating a 8*8 matrix which will contain a flag to indicate if there is any match between the two entities(rows).

I like to do tasks like this (with "connected" nodes) using igraph . So if we start with your sample data in a copy/paste-friendly data.frame format

dd <- data.frame(
    Sn = 1:8, 
    id1 = c("abc", "xyz", "qwe", "rty", "abc", "ddd", "sss", "aaa"), 
    id2 = c(123L, 111L, 222L, NA, NA, 234L, 222L, NA), 
    id3 = c(NA, "vvv", "vvv", NA, NA, NA, NA, NA),
    stringsAsFactors=F
)

Now the first step is to build an edge list connectin all the nodes on a given row

el <- rbind(
    setNames(dd[,2:3], c("A","B")),
    setNames(dd[,3:4], c("A","B"))
)
el <- el[complete.cases(el),]    #(ignore NA)

And we also need a unique list of all the vertext names

vx <- na.omit(unique(unlist(dd[, 2:4])))

Now we can create a graph object

library(igraph)
gg<-graph.data.frame(el, vertices=vx, directed=F)
plot(gg)

在此处输入图片说明

We can then use the cluster() function to find the different groups and get a group number for each vertex

newid <- data.frame(
    vertex=V(gg)$name, 
    grp=clusters(gg)$membership
)

Now, if we want to assign that back to the original data.frame, we really just need to match on the id1 column.

dd$id4 <- newid$grp[match(dd$id1, newid$vertex)]+100000
dd

#   Sn id1 id2  id3    id4
# 1  1 abc 123 <NA> 100001
# 2  2 xyz 111  vvv 100002
# 3  3 qwe 222  vvv 100002
# 4  4 rty  NA <NA> 100003
# 5  5 abc  NA <NA> 100001
# 6  6 ddd 234 <NA> 100004
# 7  7 sss 222 <NA> 100002
# 8  8 aaa  NA <NA> 100005

and we get the results you desired.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM