[英]using findAssocs to build a correlation matrix of all word combinations in R
I'm trying to write code that builds a table that shows all the correlations between all the words from a corpus. 我正在尝试编写构建表的代码,该表显示了来自语料库的所有单词之间的所有相关性。
I know that I can use findAssocs
in the tm
package to find all word correlations for a single word ie findAssocs(dtm, "quick", 0.5)
- would give me all the words that have a correlation with the word "quick" above 0.5, but I do not want to do this manually for each word in the text I have. 我知道我可以在
tm
包中使用findAssocs
来查找单个单词的所有单词相关性,即findAssocs(dtm, "quick", 0.5)
-会给我所有与0.5以上的单词“ quick”相关的单词,但我不想为我拥有的文本中的每个单词手动执行此操作。
#Loading a .csv file into R
file_loc <- "C:/temp/TESTER.csv"
x <- read.csv(file_loc, header=FALSE)
require (tm)
corp <- Corpus(DataframeSource(x))
dtm <- DocumentTermMatrix(corp)
#Clean up the text
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(stripWhitespace))
dtm <- DocumentTermMatrix(corp)
From here I can find the word correlations for individual words: 从这里我可以找到各个单词的单词相关性:
findAssocs(dtm, "quick", 0.4)
But I want to find all the correlations like this: 但是我想找到所有这样的相关性:
quick easy the and
quick 1.00 0.54 0.72 0.92
easy 0.54 1.00 0.98 0.54
the 0.72 0.98 1.00 0.05
and 0.92 0.54 0.05 1.00
Any suggestions? 有什么建议么?
Example of the "TESTER.csv" data file (starting from cell A1) 数据文件“ TESTER.csv”的示例(从单元格A1开始)
[1] I got my question answered very quickly
[2] It was quick and easy to find the information I needed
[3] My question was answered quickly by the people at stack overflow
[4] Because they're good at what they do
[5] They got it dealt with quickly and didn't mess around
[6] The information I needed was there all along
[7] They resolved it quite quickly
You can probably use as.matrix
and cor
. 您可能可以使用
as.matrix
和cor
。 findAssocs
has a lower limit of 0: findAssocs
的下限为0:
(cor_1 <- findAssocs(dtm, colnames(dtm)[1:2], 0))
# all along
# there 1.00 1.00
# information 0.65 0.65
# needed 0.65 0.65
# the 0.47 0.47
# was 0.47 0.47
cor
gets you all pearson correlations, for what it's worth: cor
可以让您获得所有皮尔逊相关性,这是有价值的:
cor_2 <- cor(as.matrix(dtm))
cor_2[c("there", "information", "needed", "the", "was"), c("all", "along")]
# all along
# there 1.0000000 1.0000000
# information 0.6454972 0.6454972
# needed 0.6454972 0.6454972
# the 0.4714045 0.4714045
# was 0.4714045 0.4714045
The preceding code: 上面的代码:
x <- readLines(n = 7)
[1] I got my question answered very quickly
[2] It was quick and easy to find the information I needed
[3] My question was answered quickly by the people at stack overflow
[4] Because they're good at what they do
[5] They got it dealt with quickly and didn't mess around
[6] The information I needed was there all along
[7] They resolved it quite quickly
library(tm)
corp <- Corpus(VectorSource(x))
dtm <- DocumentTermMatrix(corp)
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(stripWhitespace))
dtm <- DocumentTermMatrix(corp)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.