如何精确匹配R中的字符串？

Question

我有一个充满了正确和错误拼写的单词的数据框，以及从用户那里收集的单独单词列表。 我需要检查每个单词并从数据框中找到正确的拼写版本。

下面的代码完全按照我的要求工作并且工作，但是由于我使用的数据类型不同，它的近似值太多了，我需要它与单词完全匹配。 有人知道我该怎么做吗？

TDM.frame是从用户输入生成的术语文档矩阵，该矩阵是成千上万个条目的csv。

  spellDB <- read.csv("spellcheck.csv")
  words <- row.names(TDM.frame)
  k <- 0
  wordLoc <- NULL
  badWord <- NULL
  goodWord <-NULL
  for (i in 1:nrow(TDM.frame)){
    if(length(grep(words[i],spellDB$Incorrect))>0){
      k <- k + 1 
      wordLoc[k] <- grep(words[i],spellDB$Incorrect,fixed = TRUE)
      badWord[k] <- words[i]
      goodWord[k] <- as.character(spellDB$Correct[wordLoc[k]])
      corrections <- cbind(goodWord,badWord)
    }
  }

输出以下内容：

> corrections
       goodWord             badWord         
  [1,] "account"            "accounts"      
  [2,] "account"            "accout"        
  [3,] "activate"           "act"           
  [4,] "faction"            "action"        
  [5,] "activate"           "activate"      
  [6,] "activate"           "activated"

spellDB和6是正确的，因为它们在spellDB但是3和4则不正确，因此不应该匹配。

我也尝试过使用这个（和其他）正则表达式，但是这根本不起作用-我得到的唯一结果是integer（0）

grep(paste0("?=.\\b",words[12],"\\b"),spellDB$Incorrect)

我的目标是更正数据的拼写，以使术语“文档矩阵”正确且单词计数准确，如果有更好的方法可以做到这一点，那就太好了，这听起来像是一团糟，但我是新来的到R，却找不到其他选择。

谢谢阅读！

编辑：我要引用的单词列表是1143项，但头读取：

> head(words)

[1] "absolute"          "absolutely"        "acceptedcompleted" "accidently"        "accounts"          "accout"

spellDB内容如下：

    Correct                                                Incorrect
1   ability                       abilities                         
2   account  aacount accound accoun accountc acount accout accounts 
3 adventure                                      adventur adventures
4    amazon                               amazoncom amazonid amazons
5   android                                                   andoid
6     apple                                                  appleid

EDIT2：

可悲的是我无法发布所有dput，因为其中有些是敏感数据。.但是我已经删除了令人讨厌的单词，因为它们还是不相关的...

> dput(head(spellDB))
structure(list(Correct = structure(c(1L, 2L, 5L, 8L, 9L, 10L), .Label = c("ability", 
"account", "achievment", "activate", "adventure"), class = "factor"), 
Incorrect = structure(c(4L, 3L, 119L, 120L, 121L, 122L), .Label = c("", 
"", " aacount accound accoun accountc acount accout accounts ", 
" abilities ", " acheiv acheivements acheivment achi achiement achiev achievcement achieve achieved achieveent achievements achievemetn achievments achievmnet achiv achive achived achivement achivements achivment achivmenti achivments achv achviement avhivemnt", 
"andoid", "appleid", "cbind"), class = "factor")), .Names = c("Correct", 
"Incorrect"), row.names = c(NA, 6L), class = "data.frame")`

Answer 1

您dput数据对我不起作用，因此我重新创建了它：

spellDB <- read.table(text="    Correct,                                                Incorrect
   ability,                       abilities                         
   account,  aacount accound accoun accountc acount accout accounts 
 adventure,                                      adventur adventures
    amazon,                               amazoncom amazonid amazons
   android ,                                                  andoid
     apple,                                                  appleid", sep=",", as.is=T, header=T)

spellDB[,1] <- gsub(" +", " ", spellDB[,1])
spellDB[,1] <- gsub("^\\s", "", spellDB[,1])
spellDB[,2] <- gsub(" +", " ", spellDB[,2])
spellDB[,2] <- gsub("^\\s|\\s$", "", spellDB[,2])

此解决方案有效，但我不确定它对您的行数是否非常有效。 它通过检查每个单词是否出现在大黑名单中来工作，如果存在，它将查找相应的正确单词是什么并将其添加到新向量中。

incorrects.list <- strsplit(spellDB$Incorrect, " ")
incorrects.unlist <- unlist(incorrects.list)
words <- c("absolute","absolutely","acceptedcompleted","accidently", "accounts","accout")
newwords <- rep(NA, length(words))

for (w in 1:length(words)) {
  if (words[w] %in% incorrects.unlist) {
    pos <- sapply(seq_along(incorrects.list), function(i) (words[w] %in% incorrects.list[[i]]))
    newwords[w] <- spellDB$Correct[pos]
  } else {
    newwords[w] <- words[w]
  }
}

如何精确匹配R中的字符串？

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-04-28 17:37:20

如何精确匹配R中的字符串？

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-04-28 17:37:20

解决方案1
1 已采纳 2015-04-28 17:37:20