简体   繁体   中英

Efficient way to compute scores within data.table by group

I have the following data.table and I am looking to calculate by group ( id ) the smallest ( min ) jarowinkler score by all other members of that group. I have a simple nested loop that can compute this, though looking for a more efficient method.

library(data.table)
# install.packages("stringdist")
library(stringdist)

# Create `data.table`
dt <- data.table(id = c(1,1,2,2,2,3,3,3,3,4,4,4), 
                var = c("a","a","kyle","kyle","kile","rage","page","cage","","asd","fdd","xzx"))

# Add a numeric empty score variable         
dt[, "score" := as.numeric()]       
# Create a unique id within each group         
dt[, uid := sequence(.N), by = id]

dt
#     id  var score uid
#  1:  1    a    NA   1
#  2:  1    a    NA   2
#  3:  2 kyle    NA   1
#  4:  2 kyle    NA   2
#  5:  2 kile    NA   3
#  6:  3 rage    NA   1
#  7:  3 page    NA   2
#  8:  3 cage    NA   3
#  9:  3         NA   4
# 10:  4  asd    NA   1
# 11:  4  fdd    NA   2
# 12:  4  xzx    NA   3

The current, but slow method:

# Loop over all unique id's
for(i in unique(dt$id)){
   # Loop over each member and compute lowest stringdist 
   for(j in 1:nrow(dt[id == i])){
        dt[id == i & uid == j, "score" := min(stringdist(dt[id == i & uid == j, var], 
                                              dt[id == i & uid != j, var],
                                              method = "jw"))]
    }
}

dt[]
#     id  var     score uid
#  1:  1    a 0.0000000   1
#  2:  1    a 0.0000000   2
#  3:  2 kyle 0.0000000   1
#  4:  2 kyle 0.0000000   2
#  5:  2 kile 0.1666667   3
#  6:  3 rage 0.1666667   1
#  7:  3 page 0.1666667   2
#  8:  3 cage 0.1666667   3
#  9:  3      1.0000000   4
# 10:  4  asd 0.4444444   1
# 11:  4  fdd 0.4444444   2
# 12:  4  xzx 1.0000000   3

(On second thoughts, this is actually very close to David's comments) A possible approach:

#create combinations of unique var by group then call stringdist once
jw <- dt[, if (uniqueN(var)>1) transpose(combn(unique(var), 2, simplify=FALSE)), .(id)][,
    dis := stringdist(V1, V2, "jw")]

#find the min distance for each word
lu <- rbindlist(list(jw[, .(mdis=min(dis)), .(id, var=V1)], 
    jw[, .(mdis=min(dis)), .(id, var=V2)]))

#update join on the min distance for each word
dt[lu, on=.(var, id), score := mdis]

#for duplicated words, dist is 0
dt[dt[, .I[duplicated(var) | duplicated(var, fromLast=TRUE)], by=.(id)]$V1,
    score := 0]

Motivation: Since stringdist is already built for speed and runs in parallel by using 'openMP' (from manual), it will faster if you run the stringdist once rather than multiple times by group.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM