簡體   English   中英

如何使此嵌套for循環工作更快

[英]How do I make this nested for loop work faster

我的數據如下所示:

txt$txt:

my friend stays in adarsh nagar
I changed one apple one samsung S3 n one sony experia z.
Hi girls..Friends meet at bangalore
what do u think of ccd at bkc

我有詳盡的城市名稱清單。 在下面列出其中一些:

city:

ahmedabad
adarsh nagar
airoli
bangalore
bangaladesh
banerghatta Road
bkc
calcutta

我正在txt$txt搜索城市名稱(從“城市”列表中),並將它們提取到另一列(如果存在)。 所以下面的簡單循環對我有用...但是在較大的數據集上卻要花費很多時間。

for(i in 1:nrow(txt)){
    a <- c()
    for(j in 1:nrow(city)){
        a[j] <- grepl(paste("\\b",city[j,1],"\\b", sep = ""),txt$txt[i])        
    }
    txt$city[i] <- ifelse(sum(a) > 0, paste(city[which(a),1], collapse = "_"), "NONE")  
}   

我試圖使用一個套用功能,這是我所能達到的最大值。

apply(as.matrix(txt$txt), 1, function(x){ifelse(sum(unlist(strsplit(x, " ")) %in% city[,1]) > 0, paste(unlist(strsplit(x, " "))[which(unlist(strsplit(x, " ")) %in% city[,1])], collapse = "_"), "NONE")})
[1] "NONE"      "NONE"      "bangalore" "bkc"  

Desired Output:
> txt
                                                       txt         city
1                          my friend stays in adarsh nagar adarsh nagar
2 I changed one apple one samsung S3 n one sony experia z.         NONE
3                      Hi girls..Friends meet at bangalore    bangalore
4                            what do u think of ccd at bkc          bkc    

我想要R中有一個更快的過程,該過程與上述for循環的作用相同。 請指教。 謝謝

這是使用stringi包中的stri_extract_first_regex的可能性:

library(stringi)

# prepare some data
df <- data.frame(txt = c("in adarsh nagar", "sony experia z", "at bangalore"))
city <- c("ahmedabad", "adarsh nagar", "airoli", "bangalore")

df$city <- stri_extract_first_regex(str = df$txt, regex = paste(city, collapse = "|"))

df
#               txt         city
# 1 in adarsh nagar adarsh nagar
# 2  sony experia z         <NA>
# 3    at bangalore    bangalore

這應該快得多:

bigPattern <- paste('(\\b',city[,1],'\\b)',collapse='|',sep='')
txt$city <- sapply(regmatches(txt$txt,gregexpr(bigPattern,txt$txt)),FUN=function(x) ifelse(length(x) == 0,'NONE',paste(unique(x),collapse='_')))

說明:

在第一行中,我們建立一個匹配所有城市的大正則表達式,例如:

(\\bahmedabad\\b)|(\\badarsh nagar\\b)|(\\bairoli\\b)| ...

然后,我們將gregexprregmatches結合使用,以這種方式獲得txt$txt每個元素的匹配項列表。

最后,使用一個簡單的sapply ,對於列表中的每個元素,我們將匹配的城市連接起來(在刪除重復的城市之后,即多次提及的城市)。

嘗試這個:

# YOUR DATA
##########
txt <- readLines(n = 4)
my friend stays in adarsh nagar and airoli
I changed one apple one samsung S3 n one sony experia z.
Hi girls..Friends meet at bangalore
what do u think of ccd at bkc

city <- readLines(n = 8)
ahmedabad
adarsh nagar
airoli
bangalore
bangaladesh
banerghatta Road
bkc
calcutta

# MATCHING
##########
matches <- unlist(setNames(lapply(city, grep, x = txt, fixed = TRUE), 
                           city))
(res <- (sapply(1:length(txt), function(x) 
  paste0(names(matches)[matches == x], collapse = "___"))))
# [1] "adarsh nagar___airoli" ""                      
# [3] "bangalore"             "bkc" 

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM