I am working on LSA (using R) for Document Similarity Analysis. Here are my steps
Imported the text data & created Corpus. Did basis Corpus operations like stemming, white space removal etc
Created LSA space as below
tdm <- TermDocumentMatrix(chat_corpus) tdm_matrix <- as.matrix(tdm) tdm.lsa <- lw_bintf(tdm_matrix)*gw_idf(tdm_matrix) lsaSpace <- lsa(tdm.lsa)
Multi Dimensional Modelling (MDS) on LSA
'
dist.mat.lsa <- dist(t(as.textmatrix(lsaSpace)))
fit <- cmdscale(dist.mat.lsa,eig = T)
points <- data.frame(fit1$points,row.names=chat$text)
I want to create a matrix/data frame showing how similar the texts are (as shown in the attachment Result). Rows & Columns will be the texts to match while the cell values will be their similarity value. Ideally the diagonal values will be one 1 (perfect match) while the rest of the cell values will be lesser than 1.
Please trow some insights into how to do this. Thanks in advance
Note : I got the python code for this but need the same in R
similarity = np.asarray(numpy.asmatrix(dtm_lsa) * numpy.asmatrix(dtm_lsa).T)
pd.DataFrame(similarity,index=example, columns=example).head(10)
In order to do this you first need to take the S_k
and D_k
matrices from the lsa space you've created and multiply S_k
by the transpose of D_k
to get a k
by n
matrix, where k
is the number of dimensions and n
is the number of documents. This code would be as follows:
lsaMatrix <- diag(myLSAspace$sk) %*% t(myLSAspace$dk)
Then it's as simple as putting the resulting matrix through the cosine
function from the lsa
package:
simMatrix <- cosine(lsaMatrix)
Which will result in an n^2
size similarity matrix which can then be used for clustering etc.
You can read more about the S_k
and D_k
matrices in the lsa
package documentation, they're outputs of the SVD applied.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.