简体   繁体   English

使用findAssocs建立R中所有单词组合的相关矩阵

[英]using findAssocs to build a correlation matrix of all word combinations in R

I'm trying to write code that builds a table that shows all the correlations between all the words from a corpus. 我正在尝试编写构建表的代码,该表显示了来自语料库的所有单词之间的所有相关性。

I know that I can use findAssocs in the tm package to find all word correlations for a single word ie findAssocs(dtm, "quick", 0.5) - would give me all the words that have a correlation with the word "quick" above 0.5, but I do not want to do this manually for each word in the text I have. 我知道我可以在tm包中使用findAssocs来查找单个单词的所有单词相关性,即findAssocs(dtm, "quick", 0.5) -会给我所有与0.5以上的单词“ quick”相关的单词,但我不想为我拥有的文本中的每个单词手动执行此操作。

#Loading a .csv file into R
file_loc <- "C:/temp/TESTER.csv"
x <- read.csv(file_loc, header=FALSE)
require (tm)
corp <- Corpus(DataframeSource(x))
dtm <- DocumentTermMatrix(corp)

#Clean up the text
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(stripWhitespace))
dtm <- DocumentTermMatrix(corp)

From here I can find the word correlations for individual words: 从这里我可以找到各个单词的单词相关性:

findAssocs(dtm, "quick", 0.4)

But I want to find all the correlations like this: 但是我想找到所有这样的相关性:

       quick  easy   the   and 
quick   1.00  0.54  0.72  0.92     
 easy   0.54  1.00  0.98  0.54   
  the   0.72  0.98  1.00  0.05  
  and   0.92  0.54  0.05  1.00

Any suggestions? 有什么建议么?

Example of the "TESTER.csv" data file (starting from cell A1) 数据文件“ TESTER.csv”的示例(从单元格A1开始)

[1] I got my question answered very quickly
[2] It was quick and easy to find the information I needed
[3] My question was answered quickly by the people at stack overflow
[4] Because they're good at what they do
[5] They got it dealt with quickly and didn't mess around
[6] The information I needed was there all along
[7] They resolved it quite quickly

You can probably use as.matrix and cor . 您可能可以使用as.matrixcor findAssocs has a lower limit of 0: findAssocs的下限为0:

(cor_1 <- findAssocs(dtm, colnames(dtm)[1:2], 0))
#               all along
#  there       1.00  1.00
#  information 0.65  0.65
#  needed      0.65  0.65
#  the         0.47  0.47
#  was         0.47  0.47

cor gets you all pearson correlations, for what it's worth: cor可以让您获得所有皮尔逊相关性,这是有价值的:

cor_2 <- cor(as.matrix(dtm))
cor_2[c("there", "information", "needed", "the", "was"), c("all", "along")]
#                   all     along
# there       1.0000000 1.0000000
# information 0.6454972 0.6454972
# needed      0.6454972 0.6454972
# the         0.4714045 0.4714045
# was         0.4714045 0.4714045

The preceding code: 上面的代码:

x <- readLines(n = 7)
[1] I got my question answered very quickly
[2] It was quick and easy to find the information I needed
[3] My question was answered quickly by the people at stack overflow
[4] Because they're good at what they do
[5] They got it dealt with quickly and didn't mess around
[6] The information I needed was there all along
[7] They resolved it quite quickly
library(tm)
corp <- Corpus(VectorSource(x))
dtm <- DocumentTermMatrix(corp)
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(stripWhitespace))
dtm <- DocumentTermMatrix(corp)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM