简体   繁体   中英

how to add words into documents in corpus?

I'm using the tm package to run LDA on my corpus. I have a corpus containing 10,000 documents.

rtcorpus.4star <- Corpus(DataframeSource(rt.subset.4star)) ##creates the corpus
rtcorpus.4star[[1]] ##accesses the first document

I'm trying to write a piece of code that will add the word "specialword" after certain words. So essentially: for a vector of words (good, nice, happy, fun, love) that I choose, I want to the code to loop through each document, and add the word "specialword" after any of these words.

So for example, given this document:

I had a really fun time

I want the result to be this:

I had a really fun specialword time

The issue is that I'm not sure how to do this because I don't know how to get the code to read within the corpus. I know I should do a for loop (or maybe not), but I'm not sure how to loop through each word in each document, and each document in the corpus. I'm also wondering if I can use something along the lines of a "translate" function that works in tm_map.


Edit::

Made some attempts. This codes returns "test" as NA. Do you know why?

special <- c("poor", "lose")
for (i in special){
test <- gsub(special[i], paste(special[i], "specialword"), rtcorpus.1star[[1]])
}

Edit: figured it out!! thanks

special <- c("poor", "lose")
for (i in 1:length(special)){
rtcorpus.codewordtest <-gsub(special[i], paste(special[i], "specialword"), rtcorpus.codewordtest)
}

What if you tried something like this?

corpus <- read("filename.txt")
special <- c("fun","nice","love")
for (w in special) {
    gsub(w, w + " specialword", corpus)}

This would load the file, iterate through your list of words, and replace the word with the word itself followed by " specialword" (note the space).

Edit: I just saw you have multiple files. To loop through the files in the corpus, you can do this:

 corpus <- "filepath/desktop/wherever/folderwithcorpus/"
 special <- c("fun","nice","love")

 for (file in corpus){
      data <- read(file)
      for (w in special){
           gsub(w, w + " specialword", corpus)}
      }

Perhaps this is not a tm package capability, but you could do a simple paste() function for the vector of your certain words and add "specialword" immediately after. Or str_replace() in the stringr package would do this if your documents can be in a list (I think).

Then create the corpus.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM