I have a document-term matrix dtm, for example:
dtm
<<DocumentTermMatrix (documents: 50, terms: 50)>>
Non-/sparse entries: 220/2497
Sparsity : 100%
Maximal term length: 7
Weighting : term frequency (tf)
Now I want transfer it into a list of matrices, each represents a document. This is to fulfill the formal requirement of the package STM:
[[1]]
[,1] [,2] [,3] [,4]
[1,] 23 33 42 117
[2,] 2 1 3 1
[[2]]
[,1] [,2] [,3] [,4]
[1,] 2 19 93 168
[2,] 2 2 1 1
I am thinking of finding all the non-zero entries from dtm and generate them into matrices, each row at a time, so:
mat = matrix()
dtm.to.mat = function(x){
mat[1,] = x[x != 0]
mat[2,] = colnames(x[x != 0])
return(mat)
}
matrix = list(apply(dtm, 1, dtm.to.mat))
However,
x[x != 0]
just won't work. The error says:
$ operator is invalid for atomic vectors
I was wondering why this is the case. If I change x to matrix beforehand, it won't give me this error. However, I actually have a dtm of approximately 2,500,000 lines. I fear this will be very inefficient.
Me again!
I wouldn't use a dtm as the input for the stm
package unless your data is particularly strange. Use the function stm::textProcessor
. You can specify the documents to be raw (unprocessed) text from an any length character vector. You can also specify the metadata as you wish:
Suppose you have a dataframe df
with a column called df$documents
which is your raw text and df$meta
which is your covariate:
processed <- textProcessor(df$documents, metadata = df$meta, lowercase = TRUE,
removestopwords = TRUE, removenumbers = TRUE, removepunctuation = TRUE,
stem = TRUE, wordLengths = c(3, Inf))
stm_50 <- stm(documents = processed$documents, vocab = processed$vocab,
K = 50, prevalence = ~ meta, init.type = "Spectral", seed = 57468)
This will run a 50 topic STM.
textProcessor
will deal with empty documents and their associated metadata.
Edit: stm::textProcessor
is technically just a wrapper for the tm
package. But it is designed to remove problem documents, while dealing with their associated covariates.
Also the metadata argument can take a dataframe if you have multiple covariates. In that case you would also need to modify the prevalence argument in the second equation.
If you have something tricky like this I'd switch over to the quanteda
package as it has nice converters to stm
. If you want to stick with tm
have you tried using stm::convertCorpus
to change the object into the list structure stm
needs?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.