There is a dictionary data frame words.dict of approximately 44 thousand words, and the following code is supposed to substitute all the words in the dataset dataset.num for their numerical IDs from the dictionary.
data.num:
dput(head(dataset.num))
c("rt breaking will from here forward be know as", "i hope you like wine and cocktails", "this week we are upgrading our servers there may be periodic disruptions to the housing application portal sorry for any inconvenience", "hanging out in foiachat anyone have fav management software on the gov t side anything from intake to redaction onwards", "they left out kourtney instead they let chick from big bang talk", "i am encoding film like for the billionth time already ")
words.dict:
dput(head(words.dict,20)
structure(list(id = c(10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 3L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), word = structure(1:20, .Label =c("already", "am", "and", "any", "anyone", "anything", "application", "are", "as", "bang", "be", "big", "billionth", "breaking", "chick", "cocktails","disruptions", "encoding", "fav", "film", "foiachat", "for", "forward", "from", "gov", "hanging", "have", "here", "hope", "housing", "i", "in", "inconvenience", "instead", "intake", "know", "kourtney", "left", "let", "like", "management", "may", "on", "onwards", "our", "out", "periodic", "portal", "redaction", "rt", "servers", "side", "software", "sorry", "t", "talk", "the", "there", "they", "this", "time", "to", "upgrading", "we", "week", "will", "wine", "you"), class = "factor")), .Names = c("id", "word"), row.names = c(10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 3L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), class = "data.frame")
Loop:
for (i in 1:nrow(words.dict))
dataset.num <- gsub(paste0("\\b(", words.dict[i,"word"], ")\\b"),words.dict[i,1], dataset.num)
While I truncated the data, dataset.num is a character vector of almost 40 thousand lines (each line contains 20 words on average). The code works well on small data, but not so fast on large data with limited processing speed.
What would you suggest to improve the efficiency & performance of the code?
Here's a different approach, which perhaps scales better, though I haven't really tested it.
sapply(strsplit(dataset.num, "\\s+"), function(y) {
i <- match(y, words.dict$word)
y[!is.na(i)] <- words.dict$id[na.omit(i)]
paste(y, collapse = " ")
})
#[1] "rt 22 will from here forward 3 know 18"
#[2] "i hope you like wine 12 24"
#[3] "this week we 17 upgrading our servers there may 3 periodic 25 to the housing 16 portal sorry for 13 inconvenience"
#[4] "hanging out in foiachat 14 have 27 management software on the gov t side 15 from intake to redaction onwards"
#[5] "they left out kourtney instead they let 23 from 20 19 talk"
#[6] "i 11 26 28 like for the 21 time 10"
Note that you could use stringi::stri_split
to speed up the string splitting.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.