[英]Efficient way to compute scores within data.table by group
I have the following data.table
and I am looking to calculate by group ( id
) the smallest ( min
) jarowinkler score by all other members of that group. 我有以下
data.table
,我希望按组( id
)计算该组所有其他成员的最小( min
)jarowinkler得分。 I have a simple nested loop that can compute this, though looking for a more efficient method. 我正在寻找一个简单的嵌套循环,尽管它正在寻找一种更有效的方法。
library(data.table)
# install.packages("stringdist")
library(stringdist)
# Create `data.table`
dt <- data.table(id = c(1,1,2,2,2,3,3,3,3,4,4,4),
var = c("a","a","kyle","kyle","kile","rage","page","cage","","asd","fdd","xzx"))
# Add a numeric empty score variable
dt[, "score" := as.numeric()]
# Create a unique id within each group
dt[, uid := sequence(.N), by = id]
dt
# id var score uid
# 1: 1 a NA 1
# 2: 1 a NA 2
# 3: 2 kyle NA 1
# 4: 2 kyle NA 2
# 5: 2 kile NA 3
# 6: 3 rage NA 1
# 7: 3 page NA 2
# 8: 3 cage NA 3
# 9: 3 NA 4
# 10: 4 asd NA 1
# 11: 4 fdd NA 2
# 12: 4 xzx NA 3
The current, but slow method: 当前但很慢的方法:
# Loop over all unique id's
for(i in unique(dt$id)){
# Loop over each member and compute lowest stringdist
for(j in 1:nrow(dt[id == i])){
dt[id == i & uid == j, "score" := min(stringdist(dt[id == i & uid == j, var],
dt[id == i & uid != j, var],
method = "jw"))]
}
}
dt[]
# id var score uid
# 1: 1 a 0.0000000 1
# 2: 1 a 0.0000000 2
# 3: 2 kyle 0.0000000 1
# 4: 2 kyle 0.0000000 2
# 5: 2 kile 0.1666667 3
# 6: 3 rage 0.1666667 1
# 7: 3 page 0.1666667 2
# 8: 3 cage 0.1666667 3
# 9: 3 1.0000000 4
# 10: 4 asd 0.4444444 1
# 11: 4 fdd 0.4444444 2
# 12: 4 xzx 1.0000000 3
(On second thoughts, this is actually very close to David's comments) A possible approach: (再三考虑,这实际上与David的评论非常接近)一种可能的方法:
#create combinations of unique var by group then call stringdist once
jw <- dt[, if (uniqueN(var)>1) transpose(combn(unique(var), 2, simplify=FALSE)), .(id)][,
dis := stringdist(V1, V2, "jw")]
#find the min distance for each word
lu <- rbindlist(list(jw[, .(mdis=min(dis)), .(id, var=V1)],
jw[, .(mdis=min(dis)), .(id, var=V2)]))
#update join on the min distance for each word
dt[lu, on=.(var, id), score := mdis]
#for duplicated words, dist is 0
dt[dt[, .I[duplicated(var) | duplicated(var, fromLast=TRUE)], by=.(id)]$V1,
score := 0]
Motivation: Since stringdist
is already built for speed and runs in parallel by using 'openMP' (from manual), it will faster if you run the stringdist
once rather than multiple times by group. 动机:由于
stringdist
已经为提高速度而构建,并且可以通过使用'openMP'(从手册)并行运行,因此如果一次运行stringdist
而不是按组多次运行,它将更快。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.