简体   繁体   中英

number of words in a corpus

I am looking for a way to find the most frequent words in a text and I am using R. by most frequent, I mean words which their low frequency are 1% of the words in the corpus. So I need to calculate the number of words in a corpus.

Here is my code, so far:

#!/usr/bin/Rscript
library('tm')
library('wordcloud')
library('RColorBrewer')
twittercorpus <- system.file("stream","~/txt", package = "tm")
twittercorpus <- Corpus(DirSource("~/txt"),
                        readerControl=list(languageEl = "en"))
twittercorpus <- tm_map(twittercorpus, removeNumbers)
twittercorpus <- tm_map(twittercorpus,tolower)
twittercorpus <- tm_map(twittercorpus,removePunctuation)
my_stopwords <- c(stopwords("SMART"))
twittercorpus <-tm_map(twittercorpus,removeWords,my_stopwords)
mydata.dtm <- TermDocumentMatrix(twittercorpus)

I need something like:

freqmatrix <-findFreqTerms(mydata.dtm, lowfreq=rowSums(mydata.dtm)/100)

if you look at str(mydata.dtm) there is a named component called nrow . Use that:

freqmatrix <- findFreqTerms(mydata.dtm, lowfreq=mydata.dtm$nrow/100)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM