简体   繁体   中英

Document similarity using LSA in R

I am working on LSA (using R) for Document Similarity Analysis. Here are my steps

  1. Imported the text data & created Corpus. Did basis Corpus operations like stemming, white space removal etc

  2. Created LSA space as below

    tdm <- TermDocumentMatrix(chat_corpus) tdm_matrix <- as.matrix(tdm) tdm.lsa <- lw_bintf(tdm_matrix)*gw_idf(tdm_matrix) lsaSpace <- lsa(tdm.lsa)

  3. Multi Dimensional Modelling (MDS) on LSA

'

dist.mat.lsa <- dist(t(as.textmatrix(lsaSpace)))
fit <- cmdscale(dist.mat.lsa,eig = T)
points <- data.frame(fit1$points,row.names=chat$text)

I want to create a matrix/data frame showing how similar the texts are (as shown in the attachment Result). Rows & Columns will be the texts to match while the cell values will be their similarity value. Ideally the diagonal values will be one 1 (perfect match) while the rest of the cell values will be lesser than 1.

Please trow some insights into how to do this. Thanks in advance

Note : I got the python code for this but need the same in R

similarity = np.asarray(numpy.asmatrix(dtm_lsa) * numpy.asmatrix(dtm_lsa).T)
pd.DataFrame(similarity,index=example, columns=example).head(10)

Expected Result

In order to do this you first need to take the S_k and D_k matrices from the lsa space you've created and multiply S_k by the transpose of D_k to get a k by n matrix, where k is the number of dimensions and n is the number of documents. This code would be as follows:

lsaMatrix <- diag(myLSAspace$sk) %*% t(myLSAspace$dk)

Then it's as simple as putting the resulting matrix through the cosine function from the lsa package:

simMatrix <- cosine(lsaMatrix)

Which will result in an n^2 size similarity matrix which can then be used for clustering etc.

You can read more about the S_k and D_k matrices in the lsa package documentation, they're outputs of the SVD applied.

https://cran.r-project.org/web/packages/lsa/lsa.pdf

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM