如何使此循環在R中運行得更快？

Question

有大約44000字的字典數據幀words.dict，和下面的代碼應該替換數據集中dataset.num所有單詞用於從詞典它們的數字標識。

data.num：

dput(head(dataset.num))
c("rt   breaking  will from here forward be know as", "i hope you like wine and cocktails", "this week we are upgrading our servers  there may be periodic disruptions to the housing application portal  sorry for any inconvenience", "hanging out in  foiachat  anyone have fav  management software on the gov t side  anything from intake to redaction   onwards", "they left out kourtney  instead they let chick from big bang talk", "i  am  encoding  film   like  for the  billionth time already ")

words.dict：

dput(head(words.dict,20)
structure(list(id = c(10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 3L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), word = structure(1:20, .Label =c("already", "am", "and", "any", "anyone", "anything", "application", "are", "as", "bang", "be", "big", "billionth", "breaking", "chick", "cocktails","disruptions", "encoding", "fav", "film", "foiachat", "for", "forward", "from", "gov", "hanging", "have", "here", "hope", "housing", "i", "in", "inconvenience", "instead", "intake", "know", "kourtney", "left", "let", "like", "management", "may", "on", "onwards", "our", "out", "periodic", "portal", "redaction", "rt", "servers", "side", "software", "sorry", "t", "talk", "the", "there", "they", "this", "time", "to", "upgrading", "we", "week", "will", "wine", "you"), class = "factor")), .Names = c("id", "word"), row.names = c(10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 3L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), class = "data.frame")

環：

for (i in 1:nrow(words.dict))

    dataset.num <-  gsub(paste0("\\b(", words.dict[i,"word"], ")\\b"),words.dict[i,1], dataset.num)

當我截斷數據時， dataset.num是幾乎四萬行的字符向量（每行平均包含20個單詞）。 該代碼適用於小數據，但不適用於處理速度有限的大數據。

您對提高代碼的效率和性能有何建議？

Answer 1

這是另一種方法，雖然我還沒有真正測試過它，但是它可能會更好地擴展。

sapply(strsplit(dataset.num, "\\s+"), function(y) {
  i <- match(y, words.dict$word)
  y[!is.na(i)] <- words.dict$id[na.omit(i)]
  paste(y, collapse = " ")
})
#[1] "rt 22 will from here forward 3 know 18"                                                                           
#[2] "i hope you like wine 12 24"                                                                                       
#[3] "this week we 17 upgrading our servers there may 3 periodic 25 to the housing 16 portal sorry for 13 inconvenience"
#[4] "hanging out in foiachat 14 have 27 management software on the gov t side 15 from intake to redaction onwards"     
#[5] "they left out kourtney instead they let 23 from 20 19 talk"                                                       
#[6] "i 11 26 28 like for the 21 time 10"

請注意，您可以使用stringi::stri_split加快字符串拆分速度。

如何使此循環在R中運行得更快？

問題描述

1 個解決方案

解決方案1
1 已采納 2016-04-21 09:16:52

如何使此循環在R中運行得更快？

問題描述

1 個解決方案

解決方案1 1 已采納 2016-04-21 09:16:52

解決方案1
1 已采納 2016-04-21 09:16:52