使用findAssocs建立R中所有單詞組合的相關矩陣

Question

我正在嘗試編寫構建表的代碼，該表顯示了來自語料庫的所有單詞之間的所有相關性。

我知道我可以在tm包中使用findAssocs來查找單個單詞的所有單詞相關性，即findAssocs(dtm, "quick", 0.5) -會給我所有與0.5以上的單詞“ quick”相關的單詞，但我不想為我擁有的文本中的每個單詞手動執行此操作。

#Loading a .csv file into R
file_loc <- "C:/temp/TESTER.csv"
x <- read.csv(file_loc, header=FALSE)
require (tm)
corp <- Corpus(DataframeSource(x))
dtm <- DocumentTermMatrix(corp)

#Clean up the text
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(stripWhitespace))
dtm <- DocumentTermMatrix(corp)

從這里我可以找到各個單詞的單詞相關性：

findAssocs(dtm, "quick", 0.4)

但是我想找到所有這樣的相關性：

       quick  easy   the   and 
quick   1.00  0.54  0.72  0.92     
 easy   0.54  1.00  0.98  0.54   
  the   0.72  0.98  1.00  0.05  
  and   0.92  0.54  0.05  1.00

有什么建議么？

數據文件“ TESTER.csv”的示例（從單元格A1開始）

[1] I got my question answered very quickly
[2] It was quick and easy to find the information I needed
[3] My question was answered quickly by the people at stack overflow
[4] Because they're good at what they do
[5] They got it dealt with quickly and didn't mess around
[6] The information I needed was there all along
[7] They resolved it quite quickly

Answer 1

您可能可以使用as.matrix和cor 。 findAssocs的下限為0：

(cor_1 <- findAssocs(dtm, colnames(dtm)[1:2], 0))
#               all along
#  there       1.00  1.00
#  information 0.65  0.65
#  needed      0.65  0.65
#  the         0.47  0.47
#  was         0.47  0.47

cor可以讓您獲得所有皮爾遜相關性，這是有價值的：

cor_2 <- cor(as.matrix(dtm))
cor_2[c("there", "information", "needed", "the", "was"), c("all", "along")]
#                   all     along
# there       1.0000000 1.0000000
# information 0.6454972 0.6454972
# needed      0.6454972 0.6454972
# the         0.4714045 0.4714045
# was         0.4714045 0.4714045

上面的代碼：

x <- readLines(n = 7)
[1] I got my question answered very quickly
[2] It was quick and easy to find the information I needed
[3] My question was answered quickly by the people at stack overflow
[4] Because they're good at what they do
[5] They got it dealt with quickly and didn't mess around
[6] The information I needed was there all along
[7] They resolved it quite quickly
library(tm)
corp <- Corpus(VectorSource(x))
dtm <- DocumentTermMatrix(corp)
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(stripWhitespace))
dtm <- DocumentTermMatrix(corp)

使用findAssocs建立R中所有單詞組合的相關矩陣

問題描述

1 個解決方案

解決方案1
4 2015-05-22 07:04:18

使用findAssocs建立R中所有單詞組合的相關矩陣

問題描述

1 個解決方案

解決方案1 4 2015-05-22 07:04:18

解決方案1
4 2015-05-22 07:04:18