简体   繁体   English

在R中for循环用于模式匹配的更快的替代方法

[英]Faster alternative methods to for-loop in R for pattern matching

I am working on a problem in which I have to two data frames data and abbreviations and I would like to replace all the abbreviations present in data to their respective full forms. 我正在研究一个问题,其中我需要两个数据框数据和缩写,我想将数据中存在的所有缩写替换为它们各自的完整形式。 Till now I was using for-loops in the following manner 直到现在我以下面的方式使用for-loops

abb <- c()
for(i in 1:length(data$text)){
  for(j in 1:length(AbbreviationList$Abb)){
    abb <- paste("(\\b", AbbreviationList$Abb[j], "\\b)", sep="")
    data$text[i] <- gsub(abb, AbbreviationList$Fullform[j], tolower(data$text[i]))
  }
}

The abbreviation data frame looks something like the image below and can be generated using the following code 缩写数据框看起来像下面的图像,可以使用以下代码生成

在此输入图像描述

Abbreviation <- c(c("hru", "how are you"), 
                  c("asap", "as soon as possible"), 
                  c("bf", "boyfriend"), 
                  c("ur", "your"), 
                  c("u", "you"),
                  c("afk", "away from keyboard"))
Abbreviation <- data.frame(matrix(Abbreviation, ncol=2, byrow=T), row.names=NULL)

names(Abbreviation) <- c("abb","Fullform") 名称(缩写)< - c(“abb”,“Fullform”)

And the data is merely a data frame with 1 columns having text strings in each rows which can also be generated using the following code. 并且数据仅仅是具有1列的数据帧,每列具有文本串,其也可以使用以下代码生成。

在此输入图像描述

data <- data.frame(unlist(c("its good to see you, hru doing?", 
                            "I am near bridge come ASAP",
                            "Can u tell me the method u used for",
                            "afk so couldn't respond to ur mails",
                            "asmof I dont know who is your bf?")))
names(data) <- "text"

Initially, I had data frame with around 1000 observations and abbreviation of around 100. So, I was able to run the analysis. 最初,我的数据框有大约1000个观察值和大约100的缩写。所以,我能够运行分析。 But now the data has increased to almost 50000 and I am facing difficulty in processing it as there are two for-loops which makes the process very slow. 但是现在数据已经增加到接近50000并且我在处理它时遇到了困难,因为有两个for-loops使得该过程非常慢。 Can you suggest some better alternatives to for-loop and explain with an example how to use it in this situation. 你能否为for-loop建议一些更好的替代方案,并举例说明如何在这种情况下使用它。 If this problem can be solved faster via vectorization method then please suggest how to do that as well. 如果通过矢量化方法可以更快地解决这个问题,那么请建议如何做到这一点。

Thanks for the help! 谢谢您的帮助!

First of all, clearly there is no need to compile the regular expressions with each iteration of the loop. 首先,显然不需要在循环的每次迭代中编译正则表达式。 Also, there is no need to actually loop over data$text : in R, very often you can use a vector where a value could do -- and R will go through all the elements of the vector and return a vector of the same length. 此外,不需要实际循环data$text :在R中,通常可以使用值可以执行的向量 - 并且R将遍历向量的所有元素并返回相同长度的向量。

Abbreviation$regex <- sprintf( "(\\b%s\\b)", Abbreviation$abb )

for( j in 1:length( Abbreviation$abb ) ) {
    data$text <- gsub( Abbreviation$regex[j], 
                       Abbreviation$Fullform[j], data$text,
                       ignore.case= T )
 }

The above code works with the example data. 上面的代码适用于示例数据。

This should be faster, and without side effect. 这应该更快,没有副作用。

mapply(function(x,y){
  abb <- paste0("(\\b", x, "\\b)")
  gsub(abb, y, tolower(data$text))
},abriv$Abb,abriv$Fullform)
  1. gsub is vectorized so no you give it a character vector where matches are sought. gsub是矢量化的,所以不要给它一个寻找匹配的字符向量。 Here I give it data$text 在这里,我给它数据$ text
  2. I use mapply to avoid the side effect of for . 我用mapply避免的副作用for

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM