[英]Slow loop R, how make it faster?
我有一封電子郵件列表,我想使用最長的公共子字符串比較行之間的模式(相似性)。
數據是帶有電子郵件的數據框:
V1
1 "01003@163.com"
2 "cloud@coldmail.com"
3 "den_smukk_kiilar@hotmail.com"
4 "Esteban.verduzco@gmail.com"
5 "freiheitmensch@gmail.com"
6 "mitsoanastos@yahoo.com"
7 "ahmedsir744@yahoo.com"
8 ...
這是我的代碼:
library(stringdist)
for(i in 1:nrow(data)) {
sample <- data[i,]
for(j in (i+1):nrow(data)) if(i+1 <= nrow(data)) {
if((stringdist(data[j,],sample,method='lcs'))<=3) { #number of different characteres 3 (123.456 == 123.321)
duplicate <- data[j,]
email1 = as.character(data[i,])
email2 = as.character(data[j,])
pair <- cbind(email1, email2)
output3[dfrow, ] <- pair
dfrow <- dfrow + 1
}
}
}
“ outupt”是顯示類似電子郵件的數據框。
email1 email2
1 "01079@163.com" "01069@163.com"
我有30萬封電子郵件,這將永遠需要...
有更好的方法嗎?
謝謝!
這是一個嘗試:
library(stringdist)
library(stringi)
library(dplyr)
library(tidyr)
# Hypothetical data frame
data <- data.frame(V1 = paste0(stri_rand_strings(5, 3, "[a-z]"),
"@", stri_rand_strings(5, 2, "[a-z]"), ".com"),
stringsAsFactors = FALSE)
基本上,您將創建一個字符串距離成對矩陣,將其包裝在數據框中,將等於或小於3的所有字符串距離替換為相應的V1
值,其余的替換為NA
。 然后,刪除現在不再需要的V1
列,以整齊的格式gather()
數據並刪除NA
。
data %>%
data.frame(stringdistmatrix(.$V1, .$V1, useNames = TRUE, method = "lcs"),
row.names = NULL) %>%
# V1 wnw.fa.com kty.hm.com brs.wk.com pib.uo.com ryu.iq.com
#1 wnw@fa.com 0 10 10 10 10
#2 kty@hm.com 10 0 10 10 8
#3 brs@wk.com 10 10 0 8 8
#4 pib@uo.com 10 10 8 0 10
#5 ryu@iq.com 10 8 8 10 0
# here you need to replace '8' by '3' for your example
mutate_each(funs(ifelse(. <= 8 & . != 0, V1, NA)), -V1) %>%
# V1 wnw.fa.com kty.hm.com brs.wk.com pib.uo.com ryu.iq.com
#1 wnw@fa.com NA <NA> <NA> <NA> <NA>
#2 kty@hm.com NA <NA> <NA> <NA> kty@hm.com
#3 brs@wk.com NA <NA> <NA> brs@wk.com brs@wk.com
#4 pib@uo.com NA <NA> pib@uo.com <NA> <NA>
#5 ryu@iq.com NA ryu@iq.com ryu@iq.com <NA> <NA>
select(-V1) %>%
gather(email1, email2) %>%
na.omit() %>%
mutate(email1 = stri_replace_first(email1, fixed = ".", "@"))
這使:
# email1 email2
#1 kty@hm.com ryu@iq.com
#2 brs@wk.com pib@uo.com
#3 brs@wk.com ryu@iq.com
#4 pib@uo.com brs@wk.com
#5 ryu@iq.com kty@hm.com
#6 ryu@iq.com brs@wk.com
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.