使用单个协变量为主题建模运行stm的问题

Question

I'm trying to run LDA topic modelling analysis with stm but I have problems with my meta data, it seems to work fine but I have a covariate (Age) that is not being read as shown in this example. 我正在尝试使用stm运行LDA主题建模分析，但是我的元数据有问题，它似乎可以正常工作，但是我的协变量（Age）未被读取，如本示例所示。

I have some tweets (docu column in excel file) with an Age covariate (Young,Old) values.. 我有一些tweet（excel文件中的docu列），其中包含Age协变量（Young，Old）值。

Here is my data http://www.mediafire.com/file/5eb9qe6gbg22o9i/dada.xlsx/file 这是我的数据http://www.mediafire.com/file/5eb9qe6gbg22o9i/dada.xlsx/file

library(stm)
library(readxl)
library(quanteda)
library(stringr)
library(tm)


data <-  read_xlsx("C:/dada.xlsx") 

#Remove URL's 
data$docu <- str_replace_all(data$docu, "https://t.co/[a-z,A-Z,0-9]*","")


data$docu <- gsub("@\\w+", " ", data$docu)  # Remove user names (all proper names if you're wise!)

data$docu <- iconv(data$docu, to = "ASCII", sub = " ")  # Convert to basic ASCII text to avoid silly characters
data$docu <- gsub("#\\w+", " ", data$docu)

data$docu <- gsub("http.+ |http.+$", " ", data$docu)  # Remove links

data$docu <- gsub("[[:punct:]]", " ", data$docu)  # Remove punctuation)

data$docu<-  gsub("[\r\n]", "", data$docu)

data$docu <- tolower(data$docu)



#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
data$docu <- tm::removeWords(x = data$docu, c(stopwords(kind = "SMART")))

data$docu <- gsub(" +", " ", data$docu) # General spaces (should just do all whitespaces no?)

myCorpus <- corpus(data$docu)
docvars(myCorpus, "Age") <- as.factor(data$Age)


processed <- textProcessor(data$docu, metadata = data)

out <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh = 2)

out$documents
out$meta
levels(out$meta)

First_STM <- stm(documents = out$documents, vocab = out$vocab,
                 K = 4, prevalence =~ Age ,
                 max.em.its = 25, data = out$meta,
                 init.type = "LDA", verbose = FALSE)

As shown in the code I tried to define Age as factor, I think that is not needed because running textProcessor might be enough.. but nevertheless when I run levels(out$meta) I get NULL value so when I then run stm to get the actual topics I get memory allocation error.. 如代码中所示，我试图将Age定义为因素，我认为这不是必需的，因为运行textProcessor可能就足够了。.但是，尽管如此，当我运行levels(out$meta)我会得到NULL值，因此当我运行stm来获取时，实际主题我得到内存分配错误..

Answer 1

You set your metavariable of Age as factor in this line 您在这一行中将Age的元变量设置为因子

docvars(myCorpus, "Age") <- as.factor(data$Age)

But you don't use myCorpus further. 但是，您无需再使用myCorpus。 In the next steps you use your dataframe data for preprocessing. 在接下来的步骤中，您将使用数据框data进行预处理。 Try to define Age in the dataframe as factor: 尝试将数据框中的Age定义为因子：

data$Age <- factor(data$Age)

and then use it just before here 然后就在这里使用

processed <- textProcessor(data$docu, metadata = data)

out <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh = 2)

You can then look at the levels like this: 然后，您可以查看以下级别：

levels(out$meta$Age)

I could not reproduce your memory allocation error though. 我无法重现您的内存分配错误。 The stm works fine on my machine (Win 10 Pro, 8GB Ram). 该stm在我的机器上运行正常（Win 10 Pro，8GB Ram）。

使用单个协变量为主题建模运行stm的问题

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-03-07 19:42:54

使用单个协变量为主题建模运行stm的问题

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-03-07 19:42:54

解决方案1
2 已采纳 2019-03-07 19:42:54