在R中for循环用于模式匹配的更快的替代方法

Question

I am working on a problem in which I have to two data frames data and abbreviations and I would like to replace all the abbreviations present in data to their respective full forms. 我正在研究一个问题，其中我需要两个数据框数据和缩写，我想将数据中存在的所有缩写替换为它们各自的完整形式。 Till now I was using for-loops in the following manner 直到现在我以下面的方式使用for-loops

abb <- c()
for(i in 1:length(data$text)){
  for(j in 1:length(AbbreviationList$Abb)){
    abb <- paste("(\\b", AbbreviationList$Abb[j], "\\b)", sep="")
    data$text[i] <- gsub(abb, AbbreviationList$Fullform[j], tolower(data$text[i]))
  }
}

The abbreviation data frame looks something like the image below and can be generated using the following code 缩写数据框看起来像下面的图像，可以使用以下代码生成

在此输入图像描述

Abbreviation <- c(c("hru", "how are you"), 
                  c("asap", "as soon as possible"), 
                  c("bf", "boyfriend"), 
                  c("ur", "your"), 
                  c("u", "you"),
                  c("afk", "away from keyboard"))
Abbreviation <- data.frame(matrix(Abbreviation, ncol=2, byrow=T), row.names=NULL)

names(Abbreviation) <- c("abb","Fullform") 名称（缩写）< - c（“abb”，“Fullform”）

And the data is merely a data frame with 1 columns having text strings in each rows which can also be generated using the following code. 并且数据仅仅是具有1列的数据帧，每列具有文本串，其也可以使用以下代码生成。

在此输入图像描述

data <- data.frame(unlist(c("its good to see you, hru doing?", 
                            "I am near bridge come ASAP",
                            "Can u tell me the method u used for",
                            "afk so couldn't respond to ur mails",
                            "asmof I dont know who is your bf?")))
names(data) <- "text"

Initially, I had data frame with around 1000 observations and abbreviation of around 100. So, I was able to run the analysis. 最初，我的数据框有大约1000个观察值和大约100的缩写。所以，我能够运行分析。 But now the data has increased to almost 50000 and I am facing difficulty in processing it as there are two for-loops which makes the process very slow. 但是现在数据已经增加到接近50000并且我在处理它时遇到了困难，因为有两个for-loops使得该过程非常慢。 Can you suggest some better alternatives to for-loop and explain with an example how to use it in this situation. 你能否为for-loop建议一些更好的替代方案，并举例说明如何在这种情况下使用它。 If this problem can be solved faster via vectorization method then please suggest how to do that as well. 如果通过矢量化方法可以更快地解决这个问题，那么请建议如何做到这一点。

Thanks for the help! 谢谢您的帮助！

Answer 1

First of all, clearly there is no need to compile the regular expressions with each iteration of the loop. 首先，显然不需要在循环的每次迭代中编译正则表达式。 Also, there is no need to actually loop over data$text : in R, very often you can use a vector where a value could do -- and R will go through all the elements of the vector and return a vector of the same length. 此外，不需要实际循环data$text ：在R中，通常可以使用值可以执行的向量 - 并且R将遍历向量的所有元素并返回相同长度的向量。

Abbreviation$regex <- sprintf( "(\\b%s\\b)", Abbreviation$abb )

for( j in 1:length( Abbreviation$abb ) ) {
    data$text <- gsub( Abbreviation$regex[j], 
                       Abbreviation$Fullform[j], data$text,
                       ignore.case= T )
 }

The above code works with the example data. 上面的代码适用于示例数据。

Answer 2

This should be faster, and without side effect. 这应该更快，没有副作用。

mapply(function(x,y){
  abb <- paste0("(\\b", x, "\\b)")
  gsub(abb, y, tolower(data$text))
},abriv$Abb,abriv$Fullform)

gsub is vectorized so no you give it a character vector where matches are sought. gsub是矢量化的，所以不要给它一个寻找匹配的字符向量。 Here I give it data$text 在这里，我给它数据$ text
I use mapply to avoid the side effect of for . 我用mapply避免的副作用for 。

在R中for循环用于模式匹配的更快的替代方法

问题描述

2 个解决方案

解决方案1
1 2013-07-17 08:41:28

解决方案2
1 2013-07-17 09:31:54

在R中for循环用于模式匹配的更快的替代方法

问题描述

2 个解决方案

解决方案1 1 2013-07-17 08:41:28

解决方案2 1 2013-07-17 09:31:54

解决方案1
1 2013-07-17 08:41:28

解决方案2
1 2013-07-17 09:31:54