简体   繁体   English

findAssocs(tm)将所有相关性作为一个相关项列表返回

[英]findAssocs (tm) returns all correlations as a list of ones

I have a corpus created from 2 text documents and a DocumentTermMatrix of which I want to find correlations between words. 我有一个由2个文本文档和一个DocumentTermMatrix创建的语料库,我想查找它们之间的相关性。 Whatever choice of words I select the findAssocs function returns correlations = 1 for all words in the corpus. 无论选择哪种单词,我都选择findAssocs函数返回语料库中所有单词的关联= 1。 Why is that? 这是为什么?

Here are excerpts from my code: 这是我的代码的摘录:

library(tm)
library(SnowballC)
doc <- Corpus(DirSource("C:/Users/biat/Documents/customersatis"))

toSpace <- content_transformer(function(x,pattern) {return (gsub(pattern, " ", x))})

doc <- tm_map(doc, toSpace, "-")
doc <- tm_map(doc, toSpace, ":")
doc <- tm_map(doc, removePunctuation)
doc <- tm_map(doc,content_transformer(tolower))
doc <- tm_map(doc,removeNumbers)
doc <- tm_map(doc,removeWords,stopwords("swedish"))
doc <- tm_map(doc,stripWhitespace)
doc <- tm_map(doc, PlainTextDocument)
doc <- tm_map(doc, stemDocument, "swedish")

dtm <- DocumentTermMatrix(doc)
findAssocs(dtm,"active",0.1)

When I run this the results imply that the term "active" is correlated to all 560 other words by 1 as follows which in reality it's not. 当我运行此命令时,结果表明术语“活动”与所有560个其他词的相关性都为1,如下所示,而实际上并非如此。

$active
  admin    actions    all   analysis arrends   
      1          1      1          1       1 .........    
   ...................................................        

............................ website  workshops  
                                   1          1                                                       

As stated by scoa you might have two documents in which a term happens both times: resulting in ones. 如scoa所述,您可能有两个文档,其中一个术语两次都出现:导致一次。

Try collapsing the document before turning it into a corpus: 在将文档转换为语料库之前,请尝试对其进行折叠:

text <- paste(unlist(text), collapse ="")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM