I am doing a lot of analysis with the TM
package. One of my biggest problems are related to stemming and stemming-like transformations.
Let's say I have several accounting related terms (I am aware of the spelling issues).
After stemming we have:
accounts -> account
account -> account
accounting -> account
acounting -> acount
acount -> acount
acounts -> acount
accounnt -> accounnt
: 3 Terms (account, acount, account) where I would have liked 1 (account) as all these relate to the same term. :3个条款(帐户,帐户,帐户)我希望1(帐户),因为所有这些都与同一个词有关。
To correct spelling is a possibility, but I have never attempted that in R. Is that even possible? 纠正拼写是一种可能性,但我从来没有在R中尝试过。这是否可能?
The other option is to make a reference list ie account = (accounts, account, accounting, acounting, acount, acounts, accounnt) and then replace all occurrences with the master term. 另一种选择是制作一个参考列表,即账户=(账户,账户,会计,会计,账户,账户,账户),然后将所有事件替换为主条款。 How would I do this in R?
Once again, any help/suggestions would be greatly appreciated.
We could set up a list of synonyms and replace those values. For example
synonyms <- list(
list(word="account", syns=c("acount", "accounnt"))
)
This says we want to replace "acount" and "accounnt" with "account" (i'm assuming we're doing this after stemming). Now let's create test data.
raw<-c("accounts", "account", "accounting", "acounting",
"acount", "acounts", "accounnt")
And now let's define a transformation function that will replace the words in our list with the primary synonym.
library(tm)
replaceSynonyms <- content_transformer(function(x, syn=NULL) {
Reduce(function(a,b) {
gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word, a)}, syn, x)
})
Here we use the content_transformer
function to define a custom transformation. And basically we just do a gsub
to replace each of the words. We can then use this on a corpus
tm <- Corpus(VectorSource(raw))
tm <- tm_map(tm, stemDocument)
tm <- tm_map(tm, replaceSynonyms, synonyms)
inspect(tm)
and we can see all these values are transformed into "account" as desired. To add other synonyms, just add additional lists to the main synonyms
list. Each sub-list should have the names "word" and "syns".
Mr. Flick has answered question #2. I am approaching via answering question #1.
Here is an approach the uses a binary search of a known word data base ( DICTIONARY
from qdapDictionaries
). A binary lookup is slow for sure but if we make some assumptions about the replacing (like a range of differences in number of character). So here's the basic idea:
Corpus
into a unique bag of words using qdap
's bag_o_words
qdapDictionaries
' DICTIONARY
data set) to find words not recognize using match
misses
from step # 2 will be what we lookup nchar
misses
through a loop ( sapply
) and do the following: misses
using tm::stemDocument
nchar
agrep
with a max.distance
to eliminate more words from the dictionary agrep
) to determine the word from dictionary that is closest to the missed element [note this is a non-exported function from qdap
called qdap:::Ldist
] gsub
bing tm_map
with a custom tm
flavored gsub
function to replace words tm_map
and stemDocument
Here's the code. I made a fake Corpus
using the words you provide and some random words to demonstrate how to do this from start to end. You can play with range
and max.distance
that is supplied to sapply
. The looser you are with these the slower the search will be but tightiening these too much will make it more likely to make a mistake. This really isn't an answer for spelling correction in a general sense but works here because you were stemming anyway. There's an Aspell
package but I have never used it before.
terms <- c("accounts", "account", "accounting", "acounting", "acount", "acounts", "accounnt")
library(tm); library(qdap)
fake_text <- unlist(lapply(terms, function(x) {
paste(sample(c(x, sample(DICTIONARY[[1]], sample(1:5, 1)))), collapse=" ")
}))
fake_text
myCorp <- Corpus(VectorSource(fake_text))
terms2 <- unique(bag_o_words(as.data.frame(myCorp)[[2]]))
misses <- terms2[is.na(match(terms2, DICTIONARY[[1]]))]
chars <- nchar(DICTIONARY[[1]])
replacements <- sapply(misses, function(x, range = 3, max.distance = .2) {
x <- stemDocument(x)
wchar <- nchar(x)
dict <- DICTIONARY[[1]][chars >= (wchar - range) & chars <= (wchar + range)]
dict <- dict[agrep(x, dict, max.distance=max.distance)]
names(which.min(sapply(dict, qdap:::Ldist, x)))
})
replacer <- content_transformer(function(x) {
mgsub(names(replacements), replacements, x, ignore.case = FALSE, fixed = FALSE)
})
myCorp <- tm_map(myCorp, replacer)
inspect(myCorp <- tm_map(myCorp, stemDocument))
This question inspired me to attempt to write a spell check for the qdap
package. There's an interactive version that may be useful here. It's available in qdap >= version 2.1.1
. That means you'll need the dev version at the moment.. here are the steps to install:
library(devtools)
install_github("qdapDictionaries", "trinker")
install_github("qdap", "trinker")
library(tm); library(qdap)
## Recreate a Corpus
like you describe.
terms <- c("accounts", "account", "accounting", "acounting", "acount", "acounts", "accounnt")
fake_text <- unlist(lapply(terms, function(x) {
paste(sample(c(x, sample(DICTIONARY[[1]], sample(1:5, 1)))), collapse=" ")
}))
fake_text
inspect(myCorp <- Corpus(VectorSource(fake_text)))
## The interactive spell checker ( check_spelling_interactive
)
m <- check_spelling_interactive(as.data.frame(myCorp)[[2]])
preprocessed(m)
inspect(myCorp <- tm_map(myCorp, correct(m)))
The correct
function merely grabs a closure function from the output of check_spelling_interactive
and allows you to then apply the "correcting" to any new text string(s).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.