简体   繁体   English

R-获取documenttermmatrix中每个文档的令牌计数

[英]R- get token count of every document in documenttermmatrix

The reason I want to do this is so I can convert absolute frequencies into relative frequencies. 我要这样做的原因是可以将绝对频率转换为相对频率。 It is easy to get the token count for every document but I'm not sure how to get the total token count for every document and use it at the same time so I can do /total token count for every document at the same time, is there any way to bind the rowsums and then use the column in the calculstion, if this is the correct way to do this? 获取每个文档的令牌计数很容易,但是我不确定如何获取每个文档的总令牌计数并同时使用,因此我可以同时对每个文档进行/ total令牌计数,有什么办法绑定行和然后在计算中使用列,如果这是正确的方法呢?

Thanks 谢谢

Using blog data from the English version of the heliohost corpus as my text data, it's pretty easy to get token counts by document via the quanteda package. 使用英文版本的heliohost语料库的博客数据作为我的文本数据,通过quanteda包按文档获取令牌计数非常容易。

library(readr)
library(quanteda)
blogFile <- "./capstone/data/en_US.blogs.txt"
inFile <- blogFile
blogData <- read_lines(blogFile)

system.time(theText <- corpus(blogData))

head(summary(theText))

...and the output is: ...的输出是:

> head(summary(theText))
Corpus consisting of 899288 documents, showing 100 documents:

  Text Types Tokens Sentences
 text1    18     20         1
 text2     6      7         1
 text3   104    154         7
 text4    36     43         1
 text5    91    119         5
 text6    13     13         1

Source:  C:/Users/leona/gitrepos/datascience/* on x86-64 by leona
Created: Sat Dec 02 20:59:23 2017
Notes:    
>

Thank you. 谢谢。 In fact, I think I found a method, to divide by the rowSums(dtm). 实际上,我认为我找到了一种方法,可以将rowSums(dtm)除以。 I hope this is the correct approach. 我希望这是正确的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM