I'm using the tm package to run LDA on my corpus. I have a corpus containing 10,000 documents.
rtcorpus.4star <- Corpus(DataframeSource(rt.subset.4star)) ##creates the corpus
rtcorpus.4star[[1]] ##accesses the first document
I'm trying to write a piece of code that will add the word "specialword" after certain words. So essentially: for a vector of words (good, nice, happy, fun, love) that I choose, I want to the code to loop through each document, and add the word "specialword" after any of these words.
So for example, given this document:
I had a really fun time
I want the result to be this:
I had a really fun specialword time
The issue is that I'm not sure how to do this because I don't know how to get the code to read within the corpus. I know I should do a for loop (or maybe not), but I'm not sure how to loop through each word in each document, and each document in the corpus. I'm also wondering if I can use something along the lines of a "translate" function that works in tm_map.
Edit::
Made some attempts. This codes returns "test" as NA. Do you know why?
special <- c("poor", "lose")
for (i in special){
test <- gsub(special[i], paste(special[i], "specialword"), rtcorpus.1star[[1]])
}
Edit: figured it out!! thanks
special <- c("poor", "lose")
for (i in 1:length(special)){
rtcorpus.codewordtest <-gsub(special[i], paste(special[i], "specialword"), rtcorpus.codewordtest)
}
What if you tried something like this?
corpus <- read("filename.txt")
special <- c("fun","nice","love")
for (w in special) {
gsub(w, w + " specialword", corpus)}
This would load the file, iterate through your list of words, and replace the word with the word itself followed by " specialword" (note the space).
Edit: I just saw you have multiple files. To loop through the files in the corpus, you can do this:
corpus <- "filepath/desktop/wherever/folderwithcorpus/"
special <- c("fun","nice","love")
for (file in corpus){
data <- read(file)
for (w in special){
gsub(w, w + " specialword", corpus)}
}
Perhaps this is not a tm package capability, but you could do a simple paste() function for the vector of your certain words and add "specialword" immediately after. Or str_replace() in the stringr package would do this if your documents can be in a list (I think).
Then create the corpus.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.