简体   繁体   English

按组计算数据中分数的有效方法

[英]Efficient way to compute scores within data.table by group

I have the following data.table and I am looking to calculate by group ( id ) the smallest ( min ) jarowinkler score by all other members of that group. 我有以下data.table ,我希望按组( id )计算该组所有其他成员的最小( min )jarowinkler得分。 I have a simple nested loop that can compute this, though looking for a more efficient method. 我正在寻找一个简单的嵌套循环,尽管它正在寻找一种更有效的方法。

library(data.table)
# install.packages("stringdist")
library(stringdist)

# Create `data.table`
dt <- data.table(id = c(1,1,2,2,2,3,3,3,3,4,4,4), 
                var = c("a","a","kyle","kyle","kile","rage","page","cage","","asd","fdd","xzx"))

# Add a numeric empty score variable         
dt[, "score" := as.numeric()]       
# Create a unique id within each group         
dt[, uid := sequence(.N), by = id]

dt
#     id  var score uid
#  1:  1    a    NA   1
#  2:  1    a    NA   2
#  3:  2 kyle    NA   1
#  4:  2 kyle    NA   2
#  5:  2 kile    NA   3
#  6:  3 rage    NA   1
#  7:  3 page    NA   2
#  8:  3 cage    NA   3
#  9:  3         NA   4
# 10:  4  asd    NA   1
# 11:  4  fdd    NA   2
# 12:  4  xzx    NA   3

The current, but slow method: 当前但很慢的方法:

# Loop over all unique id's
for(i in unique(dt$id)){
   # Loop over each member and compute lowest stringdist 
   for(j in 1:nrow(dt[id == i])){
        dt[id == i & uid == j, "score" := min(stringdist(dt[id == i & uid == j, var], 
                                              dt[id == i & uid != j, var],
                                              method = "jw"))]
    }
}

dt[]
#     id  var     score uid
#  1:  1    a 0.0000000   1
#  2:  1    a 0.0000000   2
#  3:  2 kyle 0.0000000   1
#  4:  2 kyle 0.0000000   2
#  5:  2 kile 0.1666667   3
#  6:  3 rage 0.1666667   1
#  7:  3 page 0.1666667   2
#  8:  3 cage 0.1666667   3
#  9:  3      1.0000000   4
# 10:  4  asd 0.4444444   1
# 11:  4  fdd 0.4444444   2
# 12:  4  xzx 1.0000000   3

(On second thoughts, this is actually very close to David's comments) A possible approach: (再三考虑,这实际上与David的评论非常接近)一种可能的方法:

#create combinations of unique var by group then call stringdist once
jw <- dt[, if (uniqueN(var)>1) transpose(combn(unique(var), 2, simplify=FALSE)), .(id)][,
    dis := stringdist(V1, V2, "jw")]

#find the min distance for each word
lu <- rbindlist(list(jw[, .(mdis=min(dis)), .(id, var=V1)], 
    jw[, .(mdis=min(dis)), .(id, var=V2)]))

#update join on the min distance for each word
dt[lu, on=.(var, id), score := mdis]

#for duplicated words, dist is 0
dt[dt[, .I[duplicated(var) | duplicated(var, fromLast=TRUE)], by=.(id)]$V1,
    score := 0]

Motivation: Since stringdist is already built for speed and runs in parallel by using 'openMP' (from manual), it will faster if you run the stringdist once rather than multiple times by group. 动机:由于stringdist已经为提高速度而构建,并且可以通过使用'openMP'(从手册)并行运行,因此如果一次运行stringdist而不是按组多次运行,它将更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM