简体   繁体   中英

Document-term matrix to a list of matrices R

I have a document-term matrix dtm, for example:

    dtm
    <<DocumentTermMatrix (documents: 50, terms: 50)>>
    Non-/sparse entries: 220/2497
    Sparsity           : 100%
    Maximal term length: 7
    Weighting          : term frequency (tf)

Now I want transfer it into a list of matrices, each represents a document. This is to fulfill the formal requirement of the package STM:

    [[1]]
         [,1] [,2] [,3] [,4]
    [1,]  23   33   42   117
    [2,]   2    1    3     1

    [[2]]
         [,1] [,2] [,3] [,4]
    [1,]   2   19   93   168
    [2,]   2    2    1     1

I am thinking of finding all the non-zero entries from dtm and generate them into matrices, each row at a time, so:

    mat = matrix()
    dtm.to.mat = function(x){
        mat[1,] = x[x != 0]
        mat[2,] = colnames(x[x != 0])
        return(mat)
    }
    matrix = list(apply(dtm, 1, dtm.to.mat))

However,

     x[x != 0]

just won't work. The error says:

    $ operator is invalid for atomic vectors

I was wondering why this is the case. If I change x to matrix beforehand, it won't give me this error. However, I actually have a dtm of approximately 2,500,000 lines. I fear this will be very inefficient.

Me again!

I wouldn't use a dtm as the input for the stm package unless your data is particularly strange. Use the function stm::textProcessor . You can specify the documents to be raw (unprocessed) text from an any length character vector. You can also specify the metadata as you wish:

Suppose you have a dataframe df with a column called df$documents which is your raw text and df$meta which is your covariate:

processed <- textProcessor(df$documents, metadata = df$meta, lowercase = TRUE,
  removestopwords = TRUE, removenumbers = TRUE, removepunctuation = TRUE,
  stem = TRUE, wordLengths = c(3, Inf))

stm_50 <- stm(documents = processed$documents, vocab = processed$vocab,
  K = 50, prevalence = ~ meta, init.type = "Spectral", seed = 57468)

This will run a 50 topic STM.

textProcessor will deal with empty documents and their associated metadata.

Edit: stm::textProcessor is technically just a wrapper for the tm package. But it is designed to remove problem documents, while dealing with their associated covariates.

Also the metadata argument can take a dataframe if you have multiple covariates. In that case you would also need to modify the prevalence argument in the second equation.

If you have something tricky like this I'd switch over to the quanteda package as it has nice converters to stm . If you want to stick with tm have you tried using stm::convertCorpus to change the object into the list structure stm needs?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM