简体   繁体   English

在R中-最快的成对比较字符串相似性的方法

[英]In R - fastest way pairwise comparing character strings on similarity

I'm looking for a way to speed up the following approach. 我正在寻找一种加快以下方法的方法。 Any pointers are very welcome. 任何指针都非常欢迎。 Where are the bottlenecks? 瓶颈在哪里?

Say I have the following data.frame : 说我有以下data.frame

df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE", "J BOND", "J BOND"), 
                      v1=c("Test_a", "Test_b", "Test_a", "Test_b", "Test_b"), 
                      v2=c("Test_c", "Test_c", "Test_d", "Test_d", "Test_d"))

I want to compare each pair of rows in df on their JaroWinkler similarity. 我想比较df中JaroWinkler相似度的每一对行。

With some help of others ( see this post ), I've been able to construct this code: 在其他人的帮助下( 请参阅本文 ),我已经能够构造以下代码:

#columns to compare 
testCols <- c("names", "v1", "v2")

#compare pairs
RowCompare= function(x){
 comp <- NULL
 pairs <- t(combn(nrow(x),2))
 for(i in 1:nrow(pairs)){
   row_a <- pairs[i,1]
   row_b <- pairs[i,2]
   a_tests <- x[row_a,testCols]
   b_tests <- x[row_b,testCols]
 comp <- rbind(comp, c(row_a, row_b, TestsCompare(a_tests, b_tests)))
 }

colnames(comp) <- c("row_a","row_b","names_j","v1_j","v2_j")
return(comp)
}

#define TestsCompare
TestsCompare=function(x,y){
names_j <- stringdist(x$names, y$names, method = "jw")
v1_j <-stringdist(x$v1, y$v1, method = "jw")
v2_j <-stringdist(x$v2, y$v2, method = "jw")
c(names_j,v1_j, v2_j)
}

This generates the correct output: 这将生成正确的输出:

output = as.data.frame(RowCompare(df))

> output
   row_a row_b   names_j      v1_j      v2_j
1      1     2 0.4444444 0.1111111 0.0000000
2      1     3 0.3571429 0.0000000 0.1111111
3      1     4 0.4444444 0.1111111 0.1111111
4      1     5 0.4444444 0.1111111 0.1111111  
5      2     3 0.4603175 0.1111111 0.1111111
6      2     4 0.3333333 0.0000000 0.1111111
7      2     5 0.3333333 0.0000000 0.1111111
8      3     4 0.5634921 0.1111111 0.0000000
9      3     5 0.5634921 0.1111111 0.0000000
10     4     5 0.0000000 0.0000000 0.0000000

However, my real data.frame has 8 million observations and I make 17 comparisons. 但是,我的实际data.frame有800万观察值,我进行了17次比较。 To run this code takes days... 要运行此代码需要几天的时间...

I am looking for ways to speed up this process: 我正在寻找加快此过程的方法:

  • Should I use matrices instead of data.frames? 我应该使用矩阵而不是data.frames吗?
  • How to parallelize this process? 如何并行化此过程?
  • Vectorize? 向量化?

If you iterate over the variables you want to check, you can make a distance matrix for each with stringdist::stringdistmatrix . 如果遍历要检查的变量,则可以使用stringdist::stringdistmatrix为每个变量创建距离矩阵。 Using a form of lapply or purrr::map will return a list of distance matrices (one for each column), which you can in turn iterate over to cal broom::tidy , which will turn them into nicely formatted data.frames. 使用lapplypurrr::map的形式将返回距离矩阵的列表(每列一个),您可以依次迭代到cal broom::tidy ,它将把它们转换为格式良好的data.frames。 If you use purrr::map_df and use its .id parameter, the results will be coerced into one big data.frame, and the name of each list element will be added as a new column so you can keep them straight. 如果使用purrr::map_df并使用其.id参数,结果将被强制转换为一个大的data.frame,并且每个列表元素的名称将作为新列添加,因此您可以使它们保持直线。 The resulting data.frame will be in long form, so if you want it to match the results above, reshape with tidyr::spread . 生成的data.frame将采用长格式,因此如果您希望它与上面的结果匹配,请使用tidyr::spread重塑。

If, as you mentioned in the comments, you want to use different methods for different variables, iterate in parallel with map2 or Map . 如注释中所述,如果要对不同的变量使用不同的方法,请与map2Map并行进行迭代。

Altogether, 共,

library(tidyverse)

map2(df, c('soundex', 'jw', 'jw'), ~stringdist::stringdistmatrix(.x, method = .y)) %>% 
    map_df(broom::tidy, .id = 'var') %>% 
    spread(var, distance)

##    item1 item2 names        v1        v2
## 1      2     1     1 0.1111111 0.0000000
## 2      3     1     1 0.0000000 0.1111111
## 3      3     2     1 0.1111111 0.1111111
## 4      4     1     1 0.1111111 0.1111111
## 5      4     2     1 0.0000000 0.1111111
## 6      4     3     1 0.1111111 0.0000000
## 7      5     1     1 0.1111111 0.1111111
## 8      5     2     1 0.0000000 0.1111111
## 9      5     3     1 0.1111111 0.0000000
## 10     5     4     0 0.0000000 0.0000000

Note that while choose(5, 2) returns 10 observations, choose(8000000, 2) returns 3.2e+13 (32 trillion ) observations, so for practical purposes, even though this will work much more quickly than your existing code (and stringdistmatrix does some parallelization when possible), the data will get prohibitively big unless you are only working on subsets. 请注意,尽管select(5,2 choose(5, 2)返回10个观测值,而select(8000000,2 choose(8000000, 2)返回3.2e + 13(32 万亿 )个观测值,所以出于实际目的,即使它比现有代码(和stringdistmatrix在可能的情况下进行一些并行化操作),除非您仅在子集上工作,否则数据将变得过大。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM