简体   繁体   中英

How to take first 25 words of each corpus (in R)?

I'm guessing that the technique for this is similar to taking the first N characters from any dataframe, regardless of if it is a corpus or not.

My attempt:

create.greetings <- function(corpus, create_df = FALSE) {
  for(i in length(Charlotte.corpus.raw)) {
    Doc1<-Charlotte.corpus.raw[i]
    Word1<-Doc1[1:25]
    Greetings[i]<-Word1
  }
  return(VCorpus)
}

Where Greetings begins as a corpus with n=6. I couldn't figure out how to make a null corpus, or a corpus of large enough characters. I have a corpus of 200 documents here ( Charlotte.corpus.raw ). Unlike vectors (and by extension, dataframes), there doesn't seem to be a easy way to create null corpora.

Part of the problem is that R doesn't seem to recognize the class of "document". It only recognizes corpus. That is, that to R, a single document is a corpus of n=1.

Reproducable Sample: You will need the 'tm' and 'dplyr' and 'NLP' packages as well as more common R packages

read.corpus <- function(directory, pattern = "", to.lower = TRUE) {
 corpus <- DirSource(directory = directory, pattern = pattern) %>%
   VCorpus # Read files and create `VCorpus` object
 if(to.lower == TRUE) corpus <- # Lowercase text
     tm_map(corpus, 
            content_transformer(tolower))
 return(corpus)
}

Then run the function for any directory you have with a few txt documents, then you have a corpus to work with. Then replace Charlotte.corpus.raw from above with whatever you name your corpus as.

Each row of greetings will contain the first 25 words of each document:

greetings <- c()
for(i in 1:length(corpus)) {
  row <- unlist(corpus[i])[1:25]
  greetings <- rbind(greetings, row)
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM