[英]using findAssocs to build a correlation matrix of all word combinations in R
我正在嘗試編寫構建表的代碼,該表顯示了來自語料庫的所有單詞之間的所有相關性。
我知道我可以在tm
包中使用findAssocs
來查找單個單詞的所有單詞相關性,即findAssocs(dtm, "quick", 0.5)
-會給我所有與0.5以上的單詞“ quick”相關的單詞,但我不想為我擁有的文本中的每個單詞手動執行此操作。
#Loading a .csv file into R
file_loc <- "C:/temp/TESTER.csv"
x <- read.csv(file_loc, header=FALSE)
require (tm)
corp <- Corpus(DataframeSource(x))
dtm <- DocumentTermMatrix(corp)
#Clean up the text
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(stripWhitespace))
dtm <- DocumentTermMatrix(corp)
從這里我可以找到各個單詞的單詞相關性:
findAssocs(dtm, "quick", 0.4)
但是我想找到所有這樣的相關性:
quick easy the and
quick 1.00 0.54 0.72 0.92
easy 0.54 1.00 0.98 0.54
the 0.72 0.98 1.00 0.05
and 0.92 0.54 0.05 1.00
有什么建議么?
數據文件“ TESTER.csv”的示例(從單元格A1開始)
[1] I got my question answered very quickly
[2] It was quick and easy to find the information I needed
[3] My question was answered quickly by the people at stack overflow
[4] Because they're good at what they do
[5] They got it dealt with quickly and didn't mess around
[6] The information I needed was there all along
[7] They resolved it quite quickly
您可能可以使用as.matrix
和cor
。 findAssocs
的下限為0:
(cor_1 <- findAssocs(dtm, colnames(dtm)[1:2], 0))
# all along
# there 1.00 1.00
# information 0.65 0.65
# needed 0.65 0.65
# the 0.47 0.47
# was 0.47 0.47
cor
可以讓您獲得所有皮爾遜相關性,這是有價值的:
cor_2 <- cor(as.matrix(dtm))
cor_2[c("there", "information", "needed", "the", "was"), c("all", "along")]
# all along
# there 1.0000000 1.0000000
# information 0.6454972 0.6454972
# needed 0.6454972 0.6454972
# the 0.4714045 0.4714045
# was 0.4714045 0.4714045
上面的代碼:
x <- readLines(n = 7)
[1] I got my question answered very quickly
[2] It was quick and easy to find the information I needed
[3] My question was answered quickly by the people at stack overflow
[4] Because they're good at what they do
[5] They got it dealt with quickly and didn't mess around
[6] The information I needed was there all along
[7] They resolved it quite quickly
library(tm)
corp <- Corpus(VectorSource(x))
dtm <- DocumentTermMatrix(corp)
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(stripWhitespace))
dtm <- DocumentTermMatrix(corp)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.