如何使此嵌套for循環工作更快

Question

我的數據如下所示：

txt$txt:

my friend stays in adarsh nagar
I changed one apple one samsung S3 n one sony experia z.
Hi girls..Friends meet at bangalore
what do u think of ccd at bkc

我有詳盡的城市名稱清單。 在下面列出其中一些：

city:

ahmedabad
adarsh nagar
airoli
bangalore
bangaladesh
banerghatta Road
bkc
calcutta

我正在txt$txt搜索城市名稱（從“城市”列表中），並將它們提取到另一列（如果存在）。 所以下面的簡單循環對我有用...但是在較大的數據集上卻要花費很多時間。

for(i in 1:nrow(txt)){
    a <- c()
    for(j in 1:nrow(city)){
        a[j] <- grepl(paste("\\b",city[j,1],"\\b", sep = ""),txt$txt[i])        
    }
    txt$city[i] <- ifelse(sum(a) > 0, paste(city[which(a),1], collapse = "_"), "NONE")  
}

我試圖使用一個套用功能，這是我所能達到的最大值。

apply(as.matrix(txt$txt), 1, function(x){ifelse(sum(unlist(strsplit(x, " ")) %in% city[,1]) > 0, paste(unlist(strsplit(x, " "))[which(unlist(strsplit(x, " ")) %in% city[,1])], collapse = "_"), "NONE")})
[1] "NONE"      "NONE"      "bangalore" "bkc"  

Desired Output:
> txt
                                                       txt         city
1                          my friend stays in adarsh nagar adarsh nagar
2 I changed one apple one samsung S3 n one sony experia z.         NONE
3                      Hi girls..Friends meet at bangalore    bangalore
4                            what do u think of ccd at bkc          bkc

我想要R中有一個更快的過程，該過程與上述for循環的作用相同。 請指教。 謝謝

Answer 1

這是使用stringi包中的stri_extract_first_regex的可能性：

library(stringi)

# prepare some data
df <- data.frame(txt = c("in adarsh nagar", "sony experia z", "at bangalore"))
city <- c("ahmedabad", "adarsh nagar", "airoli", "bangalore")

df$city <- stri_extract_first_regex(str = df$txt, regex = paste(city, collapse = "|"))

df
#               txt         city
# 1 in adarsh nagar adarsh nagar
# 2  sony experia z         <NA>
# 3    at bangalore    bangalore

Answer 2

這應該快得多：

bigPattern <- paste('(\\b',city[,1],'\\b)',collapse='|',sep='')
txt$city <- sapply(regmatches(txt$txt,gregexpr(bigPattern,txt$txt)),FUN=function(x) ifelse(length(x) == 0,'NONE',paste(unique(x),collapse='_')))

說明：

在第一行中，我們建立一個匹配所有城市的大正則表達式，例如：

(\\bahmedabad\\b)|(\\badarsh nagar\\b)|(\\bairoli\\b)| ...

然后，我們將gregexpr與regmatches結合使用，以這種方式獲得txt$txt每個元素的匹配項列表。

最后，使用一個簡單的sapply ，對於列表中的每個元素，我們將匹配的城市連接起來（在刪除重復的城市之后，即多次提及的城市）。

Answer 3

嘗試這個：

# YOUR DATA
##########
txt <- readLines(n = 4)
my friend stays in adarsh nagar and airoli
I changed one apple one samsung S3 n one sony experia z.
Hi girls..Friends meet at bangalore
what do u think of ccd at bkc

city <- readLines(n = 8)
ahmedabad
adarsh nagar
airoli
bangalore
bangaladesh
banerghatta Road
bkc
calcutta

# MATCHING
##########
matches <- unlist(setNames(lapply(city, grep, x = txt, fixed = TRUE), 
                           city))
(res <- (sapply(1:length(txt), function(x) 
  paste0(names(matches)[matches == x], collapse = "___"))))
# [1] "adarsh nagar___airoli" ""                      
# [3] "bangalore"             "bkc"

如何使此嵌套for循環工作更快

問題描述

3 個解決方案

解決方案1
3 2014-06-29 14:31:01

解決方案2
1 已采納 2014-06-29 14:31:27

解決方案3
1 2014-06-29 14:38:13

如何使此嵌套for循環工作更快

問題描述

3 個解決方案

解決方案1 3 2014-06-29 14:31:01

解決方案2 1 已采納 2014-06-29 14:31:27

解決方案3 1 2014-06-29 14:38:13

解決方案1
3 2014-06-29 14:31:01

解決方案2
1 已采納 2014-06-29 14:31:27

解決方案3
1 2014-06-29 14:38:13