Document-term matrix to a list of matrices R

Question

I have a document-term matrix dtm, for example:

    dtm
    <<DocumentTermMatrix (documents: 50, terms: 50)>>
    Non-/sparse entries: 220/2497
    Sparsity           : 100%
    Maximal term length: 7
    Weighting          : term frequency (tf)

Now I want transfer it into a list of matrices, each represents a document. This is to fulfill the formal requirement of the package STM:

    [[1]]
         [,1] [,2] [,3] [,4]
    [1,]  23   33   42   117
    [2,]   2    1    3     1

    [[2]]
         [,1] [,2] [,3] [,4]
    [1,]   2   19   93   168
    [2,]   2    2    1     1

I am thinking of finding all the non-zero entries from dtm and generate them into matrices, each row at a time, so:

    mat = matrix()
    dtm.to.mat = function(x){
        mat[1,] = x[x != 0]
        mat[2,] = colnames(x[x != 0])
        return(mat)
    }
    matrix = list(apply(dtm, 1, dtm.to.mat))

However,

     x[x != 0]

just won't work. The error says:

    $ operator is invalid for atomic vectors

I was wondering why this is the case. If I change x to matrix beforehand, it won't give me this error. However, I actually have a dtm of approximately 2,500,000 lines. I fear this will be very inefficient.

Answer 1

Me again!

I wouldn't use a dtm as the input for the stm package unless your data is particularly strange. Use the function stm::textProcessor . You can specify the documents to be raw (unprocessed) text from an any length character vector. You can also specify the metadata as you wish:

Suppose you have a dataframe df with a column called df$documents which is your raw text and df$meta which is your covariate:

processed <- textProcessor(df$documents, metadata = df$meta, lowercase = TRUE,
  removestopwords = TRUE, removenumbers = TRUE, removepunctuation = TRUE,
  stem = TRUE, wordLengths = c(3, Inf))

stm_50 <- stm(documents = processed$documents, vocab = processed$vocab,
  K = 50, prevalence = ~ meta, init.type = "Spectral", seed = 57468)

This will run a 50 topic STM.

textProcessor will deal with empty documents and their associated metadata.

Edit: stm::textProcessor is technically just a wrapper for the tm package. But it is designed to remove problem documents, while dealing with their associated covariates.

Also the metadata argument can take a dataframe if you have multiple covariates. In that case you would also need to modify the prevalence argument in the second equation.

Answer 2

If you have something tricky like this I'd switch over to the quanteda package as it has nice converters to stm . If you want to stick with tm have you tried using stm::convertCorpus to change the object into the list structure stm needs?

Document-term matrix to a list of matrices R

Question

2 answers

solution1
1 2017-12-23 17:23:58

solution2
1 2017-12-29 13:55:29

Document-term matrix to a list of matrices R

Question

2 answers

solution1 1 2017-12-23 17:23:58

solution2 1 2017-12-29 13:55:29

solution1
1 2017-12-23 17:23:58

solution2
1 2017-12-29 13:55:29