慢速循環R，如何使其更快？

Question

我有一封電子郵件列表，我想使用最長的公共子字符串比較行之間的模式（相似性）。

數據是帶有電子郵件的數據框：

           V1
1   "01003@163.com"
2   "cloud@coldmail.com"
3   "den_smukk_kiilar@hotmail.com"
4   "Esteban.verduzco@gmail.com"
5   "freiheitmensch@gmail.com"
6   "mitsoanastos@yahoo.com"
7   "ahmedsir744@yahoo.com" 
8   ...

這是我的代碼：

library(stringdist)

for(i in 1:nrow(data)) {
      sample <- data[i,]
      for(j in (i+1):nrow(data)) if(i+1 <= nrow(data)) {
        if((stringdist(data[j,],sample,method='lcs'))<=3) {  #number of different characteres 3 (123.456 == 123.321)
          duplicate <- data[j,]
          email1 = as.character(data[i,])
          email2 = as.character(data[j,])
          pair <- cbind(email1, email2)
          output3[dfrow, ] <- pair
          dfrow <- dfrow + 1
        }
      }
    }

“ outupt”是顯示類似電子郵件的數據框。

         email1          email2
1   "01079@163.com" "01069@163.com"

我有30萬封電子郵件，這將永遠需要...

有更好的方法嗎？

謝謝！

Answer 1

這是一個嘗試：

library(stringdist)
library(stringi)
library(dplyr)
library(tidyr)

# Hypothetical data frame     
data <- data.frame(V1 = paste0(stri_rand_strings(5, 3, "[a-z]"), 
                               "@", stri_rand_strings(5, 2, "[a-z]"), ".com"), 
                   stringsAsFactors = FALSE)

基本上，您將創建一個字符串距離成對矩陣，將其包裝在數據框中，將等於或小於3的所有字符串距離替換為相應的V1值，其余的替換為NA 。 然后，刪除現在不再需要的V1列，以整齊的格式gather()數據並刪除NA 。

data %>%
  data.frame(stringdistmatrix(.$V1, .$V1, useNames = TRUE, method = "lcs"), 
             row.names = NULL) %>%

#          V1 wnw.fa.com kty.hm.com brs.wk.com pib.uo.com ryu.iq.com
#1 wnw@fa.com          0         10         10         10         10
#2 kty@hm.com         10          0         10         10          8
#3 brs@wk.com         10         10          0          8          8
#4 pib@uo.com         10         10          8          0         10
#5 ryu@iq.com         10          8          8         10          0

  # here you need to replace '8' by '3' for your example
  mutate_each(funs(ifelse(. <= 8 & . != 0, V1, NA)), -V1) %>% 

#          V1 wnw.fa.com kty.hm.com brs.wk.com pib.uo.com ryu.iq.com
#1 wnw@fa.com         NA       <NA>       <NA>       <NA>       <NA>
#2 kty@hm.com         NA       <NA>       <NA>       <NA> kty@hm.com
#3 brs@wk.com         NA       <NA>       <NA> brs@wk.com brs@wk.com
#4 pib@uo.com         NA       <NA> pib@uo.com       <NA>       <NA>
#5 ryu@iq.com         NA ryu@iq.com ryu@iq.com       <NA>       <NA>

  select(-V1) %>%
  gather(email1, email2) %>%
  na.omit() %>%
  mutate(email1 = stri_replace_first(email1, fixed = ".", "@"))

這使：

#      email1     email2
#1 kty@hm.com ryu@iq.com
#2 brs@wk.com pib@uo.com
#3 brs@wk.com ryu@iq.com
#4 pib@uo.com brs@wk.com
#5 ryu@iq.com kty@hm.com
#6 ryu@iq.com brs@wk.com

慢速循環R，如何使其更快？

問題描述

1 個解決方案

解決方案1
3 2015-05-19 03:54:24

慢速循環R，如何使其更快？

問題描述

1 個解決方案

解決方案1 3 2015-05-19 03:54:24

解決方案1
3 2015-05-19 03:54:24