[英]findAssocs (tm) returns all correlations as a list of ones
I have a corpus created from 2 text documents and a DocumentTermMatrix of which I want to find correlations between words. 我有一个由2个文本文档和一个DocumentTermMatrix创建的语料库,我想查找它们之间的相关性。 Whatever choice of words I select the
findAssocs
function returns correlations = 1 for all words in the corpus. 无论选择哪种单词,我都选择
findAssocs
函数返回语料库中所有单词的关联= 1。 Why is that? 这是为什么?
Here are excerpts from my code: 这是我的代码的摘录:
library(tm)
library(SnowballC)
doc <- Corpus(DirSource("C:/Users/biat/Documents/customersatis"))
toSpace <- content_transformer(function(x,pattern) {return (gsub(pattern, " ", x))})
doc <- tm_map(doc, toSpace, "-")
doc <- tm_map(doc, toSpace, ":")
doc <- tm_map(doc, removePunctuation)
doc <- tm_map(doc,content_transformer(tolower))
doc <- tm_map(doc,removeNumbers)
doc <- tm_map(doc,removeWords,stopwords("swedish"))
doc <- tm_map(doc,stripWhitespace)
doc <- tm_map(doc, PlainTextDocument)
doc <- tm_map(doc, stemDocument, "swedish")
dtm <- DocumentTermMatrix(doc)
findAssocs(dtm,"active",0.1)
When I run this the results imply that the term "active" is correlated to all 560 other words by 1 as follows which in reality it's not. 当我运行此命令时,结果表明术语“活动”与所有560个其他词的相关性都为1,如下所示,而实际上并非如此。
$active
admin actions all analysis arrends
1 1 1 1 1 .........
...................................................
............................ website workshops
1 1
As stated by scoa you might have two documents in which a term happens both times: resulting in ones. 如scoa所述,您可能有两个文档,其中一个术语两次都出现:导致一次。
Try collapsing the document before turning it into a corpus: 在将文档转换为语料库之前,请尝试对其进行折叠:
text <- paste(unlist(text), collapse ="")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.