语料库中的单词数量

Question

I am looking for a way to find the most frequent words in a text and I am using R. by most frequent, I mean words which their low frequency are 1% of the words in the corpus. 我正在寻找一种方法来找到文本中最常用的单词，而我最常使用的是R.我的意思是它们的低频率是语料库中单词的1％。 So I need to calculate the number of words in a corpus. 所以我需要计算语料库中的单词数量。

Here is my code, so far: 到目前为止，这是我的代码：

#!/usr/bin/Rscript
library('tm')
library('wordcloud')
library('RColorBrewer')
twittercorpus <- system.file("stream","~/txt", package = "tm")
twittercorpus <- Corpus(DirSource("~/txt"),
                        readerControl=list(languageEl = "en"))
twittercorpus <- tm_map(twittercorpus, removeNumbers)
twittercorpus <- tm_map(twittercorpus,tolower)
twittercorpus <- tm_map(twittercorpus,removePunctuation)
my_stopwords <- c(stopwords("SMART"))
twittercorpus <-tm_map(twittercorpus,removeWords,my_stopwords)
mydata.dtm <- TermDocumentMatrix(twittercorpus)

I need something like: 我需要这样的东西：

freqmatrix <-findFreqTerms(mydata.dtm, lowfreq=rowSums(mydata.dtm)/100)

Answer 1

if you look at str(mydata.dtm) there is a named component called nrow . 如果你看一下str(mydata.dtm) ，就会有一个名为nrow的命名组件。 Use that: 使用：

freqmatrix <- findFreqTerms(mydata.dtm, lowfreq=mydata.dtm$nrow/100)

语料库中的单词数量

问题描述

1 个解决方案

解决方案1
7 已采纳 2012-11-27 00:02:23

语料库中的单词数量

问题描述

1 个解决方案

解决方案1 7 已采纳 2012-11-27 00:02:23

解决方案1
7 已采纳 2012-11-27 00:02:23