[英]Document-term matrix to a list of matrices R
I have a document-term matrix dtm, for example: 我有一个文档项矩阵dtm,例如:
dtm
<<DocumentTermMatrix (documents: 50, terms: 50)>>
Non-/sparse entries: 220/2497
Sparsity : 100%
Maximal term length: 7
Weighting : term frequency (tf)
Now I want transfer it into a list of matrices, each represents a document. 现在,我要将其传输到矩阵列表中,每个矩阵代表一个文档。 This is to fulfill the formal requirement of the package STM:
这是为了满足STM软件包的正式要求:
[[1]]
[,1] [,2] [,3] [,4]
[1,] 23 33 42 117
[2,] 2 1 3 1
[[2]]
[,1] [,2] [,3] [,4]
[1,] 2 19 93 168
[2,] 2 2 1 1
I am thinking of finding all the non-zero entries from dtm and generate them into matrices, each row at a time, so: 我正在考虑从dtm查找所有非零条目,并将它们生成到矩阵中,一次每行,因此:
mat = matrix()
dtm.to.mat = function(x){
mat[1,] = x[x != 0]
mat[2,] = colnames(x[x != 0])
return(mat)
}
matrix = list(apply(dtm, 1, dtm.to.mat))
However, 然而,
x[x != 0]
just won't work. 就是行不通。 The error says:
错误提示:
$ operator is invalid for atomic vectors
I was wondering why this is the case. 我想知道为什么会这样。 If I change x to matrix beforehand, it won't give me this error.
如果我事先将x更改为矩阵,则不会出现此错误。 However, I actually have a dtm of approximately 2,500,000 lines.
但是,我实际上有大约2500万行的dtm。 I fear this will be very inefficient.
我担心这会非常低效。
Me again! 又是我!
I wouldn't use a dtm as the input for the stm
package unless your data is particularly strange. 除非您的数据特别奇怪,否则我不会将dtm用作
stm
包的输入。 Use the function stm::textProcessor
. 使用功能
stm::textProcessor
。 You can specify the documents to be raw (unprocessed) text from an any length character vector. 您可以从任意长度的字符向量中将文档指定为原始(未处理)文本。 You can also specify the metadata as you wish:
您还可以根据需要指定元数据:
Suppose you have a dataframe df
with a column called df$documents
which is your raw text and df$meta
which is your covariate: 假设您有一个数据框
df
其中有一列称为df$documents
,这是您的原始文本,而df$meta
是您的协变量:
processed <- textProcessor(df$documents, metadata = df$meta, lowercase = TRUE,
removestopwords = TRUE, removenumbers = TRUE, removepunctuation = TRUE,
stem = TRUE, wordLengths = c(3, Inf))
stm_50 <- stm(documents = processed$documents, vocab = processed$vocab,
K = 50, prevalence = ~ meta, init.type = "Spectral", seed = 57468)
This will run a 50 topic STM. 这将运行50个主题的STM。
textProcessor
will deal with empty documents and their associated metadata. textProcessor
将处理空文档及其关联的元数据。
Edit: stm::textProcessor
is technically just a wrapper for the tm
package. 编辑:
stm::textProcessor
从技术上来说只是tm
包的包装。 But it is designed to remove problem documents, while dealing with their associated covariates. 但这是为了在处理有问题的协变量的同时删除有问题的文档。
Also the metadata argument can take a dataframe if you have multiple covariates. 如果您有多个协变量,则元数据参数也可以采用数据框。 In that case you would also need to modify the prevalence argument in the second equation.
在这种情况下,您还需要修改第二个等式中的普遍性参数。
If you have something tricky like this I'd switch over to the quanteda
package as it has nice converters to stm
. 如果您有类似这样的棘手问题,我将切换到
quanteda
软件包,因为它具有可转换为stm
不错的转换器。 If you want to stick with tm
have you tried using stm::convertCorpus
to change the object into the list structure stm
needs? 如果您想坚持使用
tm
,是否尝试过使用stm::convertCorpus
将对象更改为stm
需要的列表结构?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.