简体   繁体   English

文档项矩阵到矩阵列表R

[英]Document-term matrix to a list of matrices R

I have a document-term matrix dtm, for example: 我有一个文档项矩阵dtm,例如:

    dtm
    <<DocumentTermMatrix (documents: 50, terms: 50)>>
    Non-/sparse entries: 220/2497
    Sparsity           : 100%
    Maximal term length: 7
    Weighting          : term frequency (tf)

Now I want transfer it into a list of matrices, each represents a document. 现在,我要将其传输到矩阵列表中,每个矩阵代表一个文档。 This is to fulfill the formal requirement of the package STM: 这是为了满足STM软件包的正式要求:

    [[1]]
         [,1] [,2] [,3] [,4]
    [1,]  23   33   42   117
    [2,]   2    1    3     1

    [[2]]
         [,1] [,2] [,3] [,4]
    [1,]   2   19   93   168
    [2,]   2    2    1     1

I am thinking of finding all the non-zero entries from dtm and generate them into matrices, each row at a time, so: 我正在考虑从dtm查找所有非零条目,并将它们生成到矩阵中,一次每行,因此:

    mat = matrix()
    dtm.to.mat = function(x){
        mat[1,] = x[x != 0]
        mat[2,] = colnames(x[x != 0])
        return(mat)
    }
    matrix = list(apply(dtm, 1, dtm.to.mat))

However, 然而,

     x[x != 0]

just won't work. 就是行不通。 The error says: 错误提示:

    $ operator is invalid for atomic vectors

I was wondering why this is the case. 我想知道为什么会这样。 If I change x to matrix beforehand, it won't give me this error. 如果我事先将x更改为矩阵,则不会出现此错误。 However, I actually have a dtm of approximately 2,500,000 lines. 但是,我实际上有大约2500万行的dtm。 I fear this will be very inefficient. 我担心这会非常低效。

Me again! 又是我!

I wouldn't use a dtm as the input for the stm package unless your data is particularly strange. 除非您的数据特别奇怪,否则我不会将dtm用作stm包的输入。 Use the function stm::textProcessor . 使用功能stm::textProcessor You can specify the documents to be raw (unprocessed) text from an any length character vector. 您可以从任意长度的字符向量中将文档指定为原始(未处理)文本。 You can also specify the metadata as you wish: 您还可以根据需要指定元数据:

Suppose you have a dataframe df with a column called df$documents which is your raw text and df$meta which is your covariate: 假设您有一个数据框df其中有一列称为df$documents ,这是您的原始文本,而df$meta是您的协变量:

processed <- textProcessor(df$documents, metadata = df$meta, lowercase = TRUE,
  removestopwords = TRUE, removenumbers = TRUE, removepunctuation = TRUE,
  stem = TRUE, wordLengths = c(3, Inf))

stm_50 <- stm(documents = processed$documents, vocab = processed$vocab,
  K = 50, prevalence = ~ meta, init.type = "Spectral", seed = 57468)

This will run a 50 topic STM. 这将运行50个主题的STM。

textProcessor will deal with empty documents and their associated metadata. textProcessor将处理空文档及其关联的元数据。

Edit: stm::textProcessor is technically just a wrapper for the tm package. 编辑: stm::textProcessor从技术上来说只是tm包的包装。 But it is designed to remove problem documents, while dealing with their associated covariates. 但这是为了在处理有问题的协变量的同时删除有问题的文档。

Also the metadata argument can take a dataframe if you have multiple covariates. 如果您有多个协变量,则元数据参数也可以采用数据框。 In that case you would also need to modify the prevalence argument in the second equation. 在这种情况下,您还需要修改第二个等式中的普遍性参数。

If you have something tricky like this I'd switch over to the quanteda package as it has nice converters to stm . 如果您有类似这样的棘手问题,我将切换到quanteda软件包,因为它具有可转换为stm不错的转换器。 If you want to stick with tm have you tried using stm::convertCorpus to change the object into the list structure stm needs? 如果您想坚持使用tm ,是否尝试过使用stm::convertCorpus将对象更改为stm需要的列表结构?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM