简体   繁体   中英

How to incorporate features from a latent semantic analysis as independent variables in a predictive model

I am trying to run logistic regression using text data in R. I have built a term document matrix and a corresponding latent semantic space. In my understanding, LSA is used in deriving 'concepts' out of 'terms' which could help in dimension reduction. Here's my code:

tdm = TermDocumentMatrix(corpus, control = list(tokenize=myngramtoken,weighting=myweight))
tdm = removeSparseTerms(tdm,0.98)
tdm = as.matrix(tdm)
tdm.lsa = lsa(tdm,dimcalc_share())
tdm.lsa_tk=as.data.frame(tdm.lsa$tk)
tdm.lsa_dk=as.data.frame(tdm.lsa$dk)
tdm.lsa_sk=as.data.frame(tdm.lsa$sk)

This gives features as V1, V2, V3.... V21. Is it possible to use these as the independent variables in my logistic regression? If so, how can I do it?

In the above example the table tdm.lsa_dk is a matrix of 'concepts' as columns and the documents where they appear as rows. This can be used as the new training and testing data set for further analysis, in this case, logistic regression. The independent variable (from the original dataset) is to be added to the new dataset. The table tdm.lsa_sk can be used for variable selection. It shows the 'concept' variables in decreasing order of importance.

     # the $dk part of the lsa will behave as your new dataset 

    new.dataset <- tdm.lsa_dk 
    new.dataset$y.var <- original.dataset$y.var

     # creating training and testing dataset out of the new dataset

    test_index <- createDataPartition(new.dataset$y, p = .2, list = F)
    Test<-new.dataset[test_index,]
    Train<-new.dataset[-test_index,]

     # create model

    model<-glm(y.var~., data=Train, family="binomial")
    prediction<-predict(model, Test, type="response")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM