I have a data frame full of words spelled correctly and incorrectly and a separate list of words gathered from the user. I need to check each word and find the correctly spelled version from the data frame.
The code below works and does exactly as I ask however it's making too many approximations, due to the type of data I'm using, I need it to match the words exactly. Does anyone know how I can do this?
TDM.frame
is a term document matrix generated from the user input, which is a csv of thousands of entries.
spellDB <- read.csv("spellcheck.csv")
words <- row.names(TDM.frame)
k <- 0
wordLoc <- NULL
badWord <- NULL
goodWord <-NULL
for (i in 1:nrow(TDM.frame)){
if(length(grep(words[i],spellDB$Incorrect))>0){
k <- k + 1
wordLoc[k] <- grep(words[i],spellDB$Incorrect,fixed = TRUE)
badWord[k] <- words[i]
goodWord[k] <- as.character(spellDB$Correct[wordLoc[k]])
corrections <- cbind(goodWord,badWord)
}
}
This outputs the following:
> corrections
goodWord badWord
[1,] "account" "accounts"
[2,] "account" "accout"
[3,] "activate" "act"
[4,] "faction" "action"
[5,] "activate" "activate"
[6,] "activate" "activated"
1, 2, 5 and 6 are correct since those are in the spellDB
however 3 and 4 are not so should not be matched.
I have tried using this (and other) Regex too however this does not work at all - the only result I ever get is integer(0)
grep(paste0("?=.\\b",words[12],"\\b"),spellDB$Incorrect)
My goal here is to correct the spelling of the data so that the term document matrix is correct and the word counts accurate, if there's a better way to do this then great, this feels like a messy way to handle it but I'm new to R and have not been able to find an alternative.
Thanks for reading!
EDIT: The words list I'm referencing is 1143 entries however the head reads:
> head(words)
[1] "absolute" "absolutely" "acceptedcompleted" "accidently" "accounts" "accout"
The spellDB
reads like this:
Correct Incorrect
1 ability abilities
2 account aacount accound accoun accountc acount accout accounts
3 adventure adventur adventures
4 amazon amazoncom amazonid amazons
5 android andoid
6 apple appleid
EDIT2:
Sadly I can't post all of the dput since some of it is sensitive data.. however I have cut out the offending words, since they are not relevant anyway...
> dput(head(spellDB))
structure(list(Correct = structure(c(1L, 2L, 5L, 8L, 9L, 10L), .Label = c("ability",
"account", "achievment", "activate", "adventure"), class = "factor"),
Incorrect = structure(c(4L, 3L, 119L, 120L, 121L, 122L), .Label = c("",
"", " aacount accound accoun accountc acount accout accounts ",
" abilities ", " acheiv acheivements acheivment achi achiement achiev achievcement achieve achieved achieveent achievements achievemetn achievments achievmnet achiv achive achived achivement achivements achivment achivmenti achivments achv achviement avhivemnt",
"andoid", "appleid", "cbind"), class = "factor")), .Names = c("Correct",
"Incorrect"), row.names = c(NA, 6L), class = "data.frame")`
Your dput
ed data didn't work for me, so I recreated it:
spellDB <- read.table(text=" Correct, Incorrect
ability, abilities
account, aacount accound accoun accountc acount accout accounts
adventure, adventur adventures
amazon, amazoncom amazonid amazons
android , andoid
apple, appleid", sep=",", as.is=T, header=T)
spellDB[,1] <- gsub(" +", " ", spellDB[,1])
spellDB[,1] <- gsub("^\\s", "", spellDB[,1])
spellDB[,2] <- gsub(" +", " ", spellDB[,2])
spellDB[,2] <- gsub("^\\s|\\s$", "", spellDB[,2])
This solution works, but I'm not sure it will be very effective for your number of rows. It works by checking if each word is present in a big black-list, and if so, it finds what is the respective correct word and adds to the new vector.
incorrects.list <- strsplit(spellDB$Incorrect, " ")
incorrects.unlist <- unlist(incorrects.list)
words <- c("absolute","absolutely","acceptedcompleted","accidently", "accounts","accout")
newwords <- rep(NA, length(words))
for (w in 1:length(words)) {
if (words[w] %in% incorrects.unlist) {
pos <- sapply(seq_along(incorrects.list), function(i) (words[w] %in% incorrects.list[[i]]))
newwords[w] <- spellDB$Correct[pos]
} else {
newwords[w] <- words[w]
}
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.