如何使此循环在R中运行得更快？

Question

There is a dictionary data frame words.dict of approximately 44 thousand words, and the following code is supposed to substitute all the words in the dataset dataset.num for their numerical IDs from the dictionary. 有大约44000字的字典数据帧words.dict，和下面的代码应该替换数据集中dataset.num所有单词用于从词典它们的数字标识。

data.num: data.num：

dput(head(dataset.num))
c("rt   breaking  will from here forward be know as", "i hope you like wine and cocktails", "this week we are upgrading our servers  there may be periodic disruptions to the housing application portal  sorry for any inconvenience", "hanging out in  foiachat  anyone have fav  management software on the gov t side  anything from intake to redaction   onwards", "they left out kourtney  instead they let chick from big bang talk", "i  am  encoding  film   like  for the  billionth time already ")

words.dict: words.dict：

dput(head(words.dict,20)
structure(list(id = c(10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 3L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), word = structure(1:20, .Label =c("already", "am", "and", "any", "anyone", "anything", "application", "are", "as", "bang", "be", "big", "billionth", "breaking", "chick", "cocktails","disruptions", "encoding", "fav", "film", "foiachat", "for", "forward", "from", "gov", "hanging", "have", "here", "hope", "housing", "i", "in", "inconvenience", "instead", "intake", "know", "kourtney", "left", "let", "like", "management", "may", "on", "onwards", "our", "out", "periodic", "portal", "redaction", "rt", "servers", "side", "software", "sorry", "t", "talk", "the", "there", "they", "this", "time", "to", "upgrading", "we", "week", "will", "wine", "you"), class = "factor")), .Names = c("id", "word"), row.names = c(10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 3L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), class = "data.frame")

Loop: 环：

for (i in 1:nrow(words.dict))

    dataset.num <-  gsub(paste0("\\b(", words.dict[i,"word"], ")\\b"),words.dict[i,1], dataset.num)

While I truncated the data, dataset.num is a character vector of almost 40 thousand lines (each line contains 20 words on average). 当我截断数据时， dataset.num是几乎四万行的字符向量（每行平均包含20个单词）。 The code works well on small data, but not so fast on large data with limited processing speed. 该代码适用于小数据，但不适用于处理速度有限的大数据。

What would you suggest to improve the efficiency & performance of the code? 您对提高代码的效率和性能有何建议？

Answer 1

Here's a different approach, which perhaps scales better, though I haven't really tested it. 这是另一种方法，虽然我还没有真正测试过它，但是它可能会更好地扩展。

sapply(strsplit(dataset.num, "\\s+"), function(y) {
  i <- match(y, words.dict$word)
  y[!is.na(i)] <- words.dict$id[na.omit(i)]
  paste(y, collapse = " ")
})
#[1] "rt 22 will from here forward 3 know 18"                                                                           
#[2] "i hope you like wine 12 24"                                                                                       
#[3] "this week we 17 upgrading our servers there may 3 periodic 25 to the housing 16 portal sorry for 13 inconvenience"
#[4] "hanging out in foiachat 14 have 27 management software on the gov t side 15 from intake to redaction onwards"     
#[5] "they left out kourtney instead they let 23 from 20 19 talk"                                                       
#[6] "i 11 26 28 like for the 21 time 10"

Note that you could use stringi::stri_split to speed up the string splitting. 请注意，您可以使用stringi::stri_split加快字符串拆分速度。

如何使此循环在R中运行得更快？

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-04-21 09:16:52

如何使此循环在R中运行得更快？

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-04-21 09:16:52

解决方案1
1 已采纳 2016-04-21 09:16:52