[英]In R - fastest way pairwise comparing character strings on similarity
I'm looking for a way to speed up the following approach. 我正在寻找一种加快以下方法的方法。 Any pointers are very welcome.
任何指针都非常欢迎。 Where are the bottlenecks?
瓶颈在哪里?
Say I have the following data.frame
: 说我有以下
data.frame
:
df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE", "J BOND", "J BOND"),
v1=c("Test_a", "Test_b", "Test_a", "Test_b", "Test_b"),
v2=c("Test_c", "Test_c", "Test_d", "Test_d", "Test_d"))
I want to compare each pair of rows in df
on their JaroWinkler similarity. 我想比较
df
中JaroWinkler相似度的每一对行。
With some help of others ( see this post ), I've been able to construct this code: 在其他人的帮助下( 请参阅本文 ),我已经能够构造以下代码:
#columns to compare
testCols <- c("names", "v1", "v2")
#compare pairs
RowCompare= function(x){
comp <- NULL
pairs <- t(combn(nrow(x),2))
for(i in 1:nrow(pairs)){
row_a <- pairs[i,1]
row_b <- pairs[i,2]
a_tests <- x[row_a,testCols]
b_tests <- x[row_b,testCols]
comp <- rbind(comp, c(row_a, row_b, TestsCompare(a_tests, b_tests)))
}
colnames(comp) <- c("row_a","row_b","names_j","v1_j","v2_j")
return(comp)
}
#define TestsCompare
TestsCompare=function(x,y){
names_j <- stringdist(x$names, y$names, method = "jw")
v1_j <-stringdist(x$v1, y$v1, method = "jw")
v2_j <-stringdist(x$v2, y$v2, method = "jw")
c(names_j,v1_j, v2_j)
}
This generates the correct output: 这将生成正确的输出:
output = as.data.frame(RowCompare(df))
> output
row_a row_b names_j v1_j v2_j
1 1 2 0.4444444 0.1111111 0.0000000
2 1 3 0.3571429 0.0000000 0.1111111
3 1 4 0.4444444 0.1111111 0.1111111
4 1 5 0.4444444 0.1111111 0.1111111
5 2 3 0.4603175 0.1111111 0.1111111
6 2 4 0.3333333 0.0000000 0.1111111
7 2 5 0.3333333 0.0000000 0.1111111
8 3 4 0.5634921 0.1111111 0.0000000
9 3 5 0.5634921 0.1111111 0.0000000
10 4 5 0.0000000 0.0000000 0.0000000
However, my real data.frame has 8 million observations and I make 17 comparisons. 但是,我的实际data.frame有800万观察值,我进行了17次比较。 To run this code takes days...
要运行此代码需要几天的时间...
I am looking for ways to speed up this process: 我正在寻找加快此过程的方法:
If you iterate over the variables you want to check, you can make a distance matrix for each with stringdist::stringdistmatrix
. 如果遍历要检查的变量,则可以使用
stringdist::stringdistmatrix
为每个变量创建距离矩阵。 Using a form of lapply
or purrr::map
will return a list of distance matrices (one for each column), which you can in turn iterate over to cal broom::tidy
, which will turn them into nicely formatted data.frames. 使用
lapply
或purrr::map
的形式将返回距离矩阵的列表(每列一个),您可以依次迭代到cal broom::tidy
,它将把它们转换为格式良好的data.frames。 If you use purrr::map_df
and use its .id
parameter, the results will be coerced into one big data.frame, and the name of each list element will be added as a new column so you can keep them straight. 如果使用
purrr::map_df
并使用其.id
参数,结果将被强制转换为一个大的data.frame,并且每个列表元素的名称将作为新列添加,因此您可以使它们保持直线。 The resulting data.frame will be in long form, so if you want it to match the results above, reshape with tidyr::spread
. 生成的data.frame将采用长格式,因此如果您希望它与上面的结果匹配,请使用
tidyr::spread
重塑。
If, as you mentioned in the comments, you want to use different methods for different variables, iterate in parallel with map2
or Map
. 如注释中所述,如果要对不同的变量使用不同的方法,请与
map2
或Map
并行进行迭代。
Altogether, 共,
library(tidyverse)
map2(df, c('soundex', 'jw', 'jw'), ~stringdist::stringdistmatrix(.x, method = .y)) %>%
map_df(broom::tidy, .id = 'var') %>%
spread(var, distance)
## item1 item2 names v1 v2
## 1 2 1 1 0.1111111 0.0000000
## 2 3 1 1 0.0000000 0.1111111
## 3 3 2 1 0.1111111 0.1111111
## 4 4 1 1 0.1111111 0.1111111
## 5 4 2 1 0.0000000 0.1111111
## 6 4 3 1 0.1111111 0.0000000
## 7 5 1 1 0.1111111 0.1111111
## 8 5 2 1 0.0000000 0.1111111
## 9 5 3 1 0.1111111 0.0000000
## 10 5 4 0 0.0000000 0.0000000
Note that while choose(5, 2)
returns 10 observations, choose(8000000, 2)
returns 3.2e+13 (32 trillion ) observations, so for practical purposes, even though this will work much more quickly than your existing code (and stringdistmatrix
does some parallelization when possible), the data will get prohibitively big unless you are only working on subsets. 请注意,尽管select(5,2
choose(5, 2)
返回10个观测值,而select(8000000,2 choose(8000000, 2)
返回3.2e + 13(32 万亿 )个观测值,所以出于实际目的,即使它比现有代码(和stringdistmatrix
在可能的情况下进行一些并行化操作),除非您仅在子集上工作,否则数据将变得过大。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.