查找语料库中单个文档的特定单词的频率 - R, TermDocumentMatrix, TM

Question

对于我正在进行的一个研究项目，我已将 pdf 文档读入 R，创建了一个语料库和一个 TermDocumentMatrix。 我想检查我的语料库中每个文档中特定单词的频率。 下面的代码为我提供了我想要的矩阵类型，以及文档中单词的频率，但显然它只处理高频术语而不是特定术语。

ft <- findFreqTerms(tdm, lowfreq = 100, highfreq = Inf)
as.matrix(opinions.tdm[ft,])

我在另一条评论中找到了下面的代码，它允许搜索特定术语的频率，但是，它在文档中求和。 如何调整它以便我在每个文档中而不是在整个文档中搜索特定术语？

library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))


tdm <- TermDocumentMatrix(crude)

# turn tdm into dense matrix and create frequency vector. 
freq <- rowSums(as.matrix(tdm))
freq["crude"]
crude 
   21 
freq["oil"]
oil 
 85

Answer 1

跳过rowSums部分，只参考矩阵

term_matrix <-as.matrix(tdm)
term_matrix["crude",]
# 127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 
#   2   0   2   3   0   2   0   0   0   0   5   2   0   2   0   0 
# 502 543 704 708 
#   0   2   0   1 
term_matrix["oil",]
# 127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 
#   5  12   2   1   1   7   3   3   5   9   5   4   5   4   3   4 
# 502 543 704 708 
#   5   3   3   1

查找语料库中单个文档的特定单词的频率 - R, TermDocumentMatrix, TM

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-07-08 05:00:35

查找语料库中单个文档的特定单词的频率 - R, TermDocumentMatrix, TM

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-07-08 05:00:35

解决方案1
0 已采纳 2020-07-08 05:00:35