简体   繁体   English

如何将潜在语义分析中的特征作为独立变量合并到预测模型中

[英]How to incorporate features from a latent semantic analysis as independent variables in a predictive model

I am trying to run logistic regression using text data in R. I have built a term document matrix and a corresponding latent semantic space. 我正在尝试使用R中的文本数据运行逻辑回归。我已经建立了术语文档矩阵和相应的潜在语义空间。 In my understanding, LSA is used in deriving 'concepts' out of 'terms' which could help in dimension reduction. 在我的理解中,LSA用于从“术语”中推导出“概念”,这可能有助于减小尺寸。 Here's my code: 这是我的代码:

tdm = TermDocumentMatrix(corpus, control = list(tokenize=myngramtoken,weighting=myweight))
tdm = removeSparseTerms(tdm,0.98)
tdm = as.matrix(tdm)
tdm.lsa = lsa(tdm,dimcalc_share())
tdm.lsa_tk=as.data.frame(tdm.lsa$tk)
tdm.lsa_dk=as.data.frame(tdm.lsa$dk)
tdm.lsa_sk=as.data.frame(tdm.lsa$sk)

This gives features as V1, V2, V3.... V21. 这提供了V1,V2,V3 ... V21等功能。 Is it possible to use these as the independent variables in my logistic regression? 在我的逻辑回归中是否可以将它们用作自变量? If so, how can I do it? 如果是这样,我该怎么办?

In the above example the table tdm.lsa_dk is a matrix of 'concepts' as columns and the documents where they appear as rows. 在上面的示例中,表tdm.lsa_dk是“概念”的矩阵,以列为单位,而文档以行的形式出现。 This can be used as the new training and testing data set for further analysis, in this case, logistic regression. 可以将其用作新的训练和测试数据集,以进行进一步分析(在这种情况下为逻辑回归)。 The independent variable (from the original dataset) is to be added to the new dataset. 自变量(来自原始数据集)将被添加到新数据集中。 The table tdm.lsa_sk can be used for variable selection. 表tdm.lsa_sk可用于变量选择。 It shows the 'concept' variables in decreasing order of importance. 它按重要性的降序显示“概念”变量。

     # the $dk part of the lsa will behave as your new dataset 

    new.dataset <- tdm.lsa_dk 
    new.dataset$y.var <- original.dataset$y.var

     # creating training and testing dataset out of the new dataset

    test_index <- createDataPartition(new.dataset$y, p = .2, list = F)
    Test<-new.dataset[test_index,]
    Train<-new.dataset[-test_index,]

     # create model

    model<-glm(y.var~., data=Train, family="binomial")
    prediction<-predict(model, Test, type="response")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 概率潜在语义分析 - probabilistic latent semantic analysis R 获取潜在语义分析 (lsa) 对象并对 R 中的新数据进行评分 - Taking a latent semantic analysis (lsa) object and scoring on new data in R 使用混合模型的R中的潜在类分析 - Latent Class Analysis in R using a mix model 如何在 R 中使用分类变量和连续变量执行潜在配置文件分析? - How do I perform a Latent Profile Analysis with both categorical and continuous variables in R? 如何使用R从一组不同类型的变量生成潜在变量? - How to generate a latent variable from a set of different kinds of variables with R? 使用gmnl()在潜在类模型中建模通用变量 - Modelling generic variables in a Latent class model with gmnl() poLCA - 潜在类别分析 - 分析需要多长时间? - poLCA - Latent Class Analysis - How long should analysis take? 对 lavaan 潜在增长曲线模型运行功效分析 - Running a power analysis on a lavaan latent growth curve model 潜在语义索引如何用于特征选择? - How can Latent Semantic Indexing be used for feature selection? 如何使用不同的编码方案对一组项目的多个潜在变量进行建模? - How to model multiple latent variables of a set of items using different coding schemes?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM