文档项矩阵到矩阵列表R

Question

I have a document-term matrix dtm, for example: 我有一个文档项矩阵dtm，例如：

    dtm
    <<DocumentTermMatrix (documents: 50, terms: 50)>>
    Non-/sparse entries: 220/2497
    Sparsity           : 100%
    Maximal term length: 7
    Weighting          : term frequency (tf)

Now I want transfer it into a list of matrices, each represents a document. 现在，我要将其传输到矩阵列表中，每个矩阵代表一个文档。 This is to fulfill the formal requirement of the package STM: 这是为了满足STM软件包的正式要求：

    [[1]]
         [,1] [,2] [,3] [,4]
    [1,]  23   33   42   117
    [2,]   2    1    3     1

    [[2]]
         [,1] [,2] [,3] [,4]
    [1,]   2   19   93   168
    [2,]   2    2    1     1

I am thinking of finding all the non-zero entries from dtm and generate them into matrices, each row at a time, so: 我正在考虑从dtm查找所有非零条目，并将它们生成到矩阵中，一次每行，因此：

    mat = matrix()
    dtm.to.mat = function(x){
        mat[1,] = x[x != 0]
        mat[2,] = colnames(x[x != 0])
        return(mat)
    }
    matrix = list(apply(dtm, 1, dtm.to.mat))

However, 然而，

     x[x != 0]

just won't work. 就是行不通。 The error says: 错误提示：

    $ operator is invalid for atomic vectors

I was wondering why this is the case. 我想知道为什么会这样。 If I change x to matrix beforehand, it won't give me this error. 如果我事先将x更改为矩阵，则不会出现此错误。 However, I actually have a dtm of approximately 2,500,000 lines. 但是，我实际上有大约2500万行的dtm。 I fear this will be very inefficient. 我担心这会非常低效。

Answer 1

Me again! 又是我！

I wouldn't use a dtm as the input for the stm package unless your data is particularly strange. 除非您的数据特别奇怪，否则我不会将dtm用作stm包的输入。 Use the function stm::textProcessor . 使用功能stm::textProcessor 。 You can specify the documents to be raw (unprocessed) text from an any length character vector. 您可以从任意长度的字符向量中将文档指定为原始（未处理）文本。 You can also specify the metadata as you wish: 您还可以根据需要指定元数据：

Suppose you have a dataframe df with a column called df$documents which is your raw text and df$meta which is your covariate: 假设您有一个数据框df其中有一列称为df$documents ，这是您的原始文本，而df$meta是您的协变量：

processed <- textProcessor(df$documents, metadata = df$meta, lowercase = TRUE,
  removestopwords = TRUE, removenumbers = TRUE, removepunctuation = TRUE,
  stem = TRUE, wordLengths = c(3, Inf))

stm_50 <- stm(documents = processed$documents, vocab = processed$vocab,
  K = 50, prevalence = ~ meta, init.type = "Spectral", seed = 57468)

This will run a 50 topic STM. 这将运行50个主题的STM。

textProcessor will deal with empty documents and their associated metadata. textProcessor将处理空文档及其关联的元数据。

Edit: stm::textProcessor is technically just a wrapper for the tm package. 编辑： stm::textProcessor从技术上来说只是tm包的包装。 But it is designed to remove problem documents, while dealing with their associated covariates. 但这是为了在处理有问题的协变量的同时删除有问题的文档。

Also the metadata argument can take a dataframe if you have multiple covariates. 如果您有多个协变量，则元数据参数也可以采用数据框。 In that case you would also need to modify the prevalence argument in the second equation. 在这种情况下，您还需要修改第二个等式中的普遍性参数。

Answer 2

If you have something tricky like this I'd switch over to the quanteda package as it has nice converters to stm . 如果您有类似这样的棘手问题，我将切换到quanteda软件包，因为它具有可转换为stm不错的转换器。 If you want to stick with tm have you tried using stm::convertCorpus to change the object into the list structure stm needs? 如果您想坚持使用tm ，是否尝试过使用stm::convertCorpus将对象更改为stm需要的列表结构？

文档项矩阵到矩阵列表R

问题描述

2 个解决方案

解决方案1
1 2017-12-23 17:23:58

解决方案2
1 2017-12-29 13:55:29

文档项矩阵到矩阵列表R

问题描述

2 个解决方案

解决方案1 1 2017-12-23 17:23:58

解决方案2 1 2017-12-29 13:55:29

解决方案1
1 2017-12-23 17:23:58

解决方案2
1 2017-12-29 13:55:29