R-比較兩個語料庫以創建一個新的語料庫，該語料庫具有來自語料庫＃1的較高頻率的單詞

Question

我有兩個包含相似詞的語料庫。 足夠相似，使用setdiff並不能真正解決我的問題。 因此，我已經轉向尋找一種方法來提取語料庫中頻率更高的列表或語料庫（以最終形成單詞雲）（假設像這樣會有閾值-也許像這樣更頻繁50％？） 1，與語料庫2相比。

這就是我現在擁有的一切：

> install.packages("tm")
> install.packages("SnowballC")
> install.packages("wordcloud")
> install.packages("RColorBrewer")
> library(tm)
> library(SnowballC)
> library(wordcloud)
> library(RColorBrewer)

> UKDraft = read.csv("UKDraftScouting.csv", stringsAsFactors=FALSE)
> corpus = Corpus(VectorSource(UKDraft$Report))
> corpus = tm_map(corpus, tolower)
> corpus = tm_map(corpus, PlainTextDocument)
> corpus = tm_map(corpus, removePunctuation)
> corpus = tm_map(corpus, removeWords, c("strengths", "weaknesses", "notes",  "kentucky", "wildcats", stopwords("english")))
> frequencies = DocumentTermMatrix(corpus)
> allReports = as.data.frame(as.matrix(frequencies))

> SECDraft = read.csv("SECMinusUKDraftScouting.csv", stringsAsFactors=FALSE)
> SECcorpus = Corpus(VectorSource(SECDraft$Report))
> SECcorpus = tm_map(SECcorpus, tolower)
> SECcorpus = tm_map(SECcorpus, PlainTextDocument)
> SECcorpus = tm_map(SECcorpus, removePunctuation)
> SECcorpus = tm_map(SECcorpus, removeWords, c("strengths", "weaknesses", "notes", stopwords("english")))
> SECfrequencies = DocumentTermMatrix(SECcorpus)
> SECallReports = as.data.frame(as.matrix(SECfrequencies))

因此，如果單詞“ wingspan”在語料庫2（“ SECcorpus”）中具有100個計數頻率，而在語料庫1（“ corpus”）中具有150個計數頻率，那么我們希望在生成的語料庫/列表中使用該詞。

Answer 1

我可以根據我與Paul Nulty共同開發的新文本分析包，提出一種可能更簡單的方法。 它稱為Quanteda，可在CRAN和GitHub上使用。

我無權訪問您的文本，但是對於您的示例，這將以類似的方式工作。 創建兩個文檔集的語料庫，然后添加一個文檔變量（使用docvars ），然后在新文檔分區變量上創建一個文檔特征矩陣分組。 其余操作很簡單，請參見下面的代碼。 請注意，默認情況下， dfm對象是稀疏矩陣，但是功能的子集尚未實現（下一版本！）。

install.packages(quanteda)
library(quanteda)

# built-in character vector of 57 inaugural addreses
str(inaugTexts)

# create a corpus, with a partition variable to represent
# the two sets of texts you want to compare
inaugCorp <- corpus(inaugTexts, 
                    docvars = data.frame(docset = c(rep(1, 29), rep(2, 28))),
                    notes = "Example made for stackoverflow")
# summarize the corpus
summary(inaugCorp, 5)

# toLower, removePunct are on by default
inaugDfm <- dfm(inaugCorp, 
                groups = "docset", # by docset instead of document
                ignoredFeatures = c("strengths", "weaknesses", "notes", stopwords("english"))),
                matrixType = "dense")

# now compare frequencies and trim based on ratio threshold
ratioThreshold <- 1.5
featureRatio <- inaugDfm[2, ] / inaugDfm[1, ]
# to select where set 2 feature frequency is 1.5x set 1 feature frequency
inaugDfmReduced <- inaugDfm[2, featureRatio >= ratioThreshold]

# plot the wordcloud
plot(inaugDfmReduced)

我建議您將一些選項傳遞給wordcloud() （ plot.dfm()使用什么），也許是為了限制要繪制的最小特征數。

很高興為您提供有關使用quanteda軟件包的任何查詢的幫助。

新

這是直接解決您的問題的方法。 我沒有您的文件，因此無法驗證它是否有效。 同樣，如果您的R技能有限，您可能會發現很難理解。 同上（如果您還沒有看過quanteda任何文檔）（目前非常有限）。

我認為您需要的是基於您的評論/查詢的以下內容：

# read in each corpus separately, directly into quanteda
mycorpus1 <- corpus(textfile("UKDraftScouting.csv", textField = "Report"))
mycorpus2 <- corpus(textfile("SECMinusUKDraftScouting.csv", textField = "Report"))
# assign docset variables to each corpus as appropriate 
docvars(mycorpus1, "docset") <- 1 
docvars(mycorpus2, "docset") <- 2
myCombinedCorpus <- mycorpus1 + mycorpus2

然后繼續上述dfm步驟，用myCombinedCorpus替換inaugTexts 。

Answer 2

我正在更新@ken Benoit的答案，因為它已經有好幾年歷史了，而且Quanteda軟件包在語法上進行了一些重大更改。

當前版本應為（2017年4月）：

str(inaugTexts)

# create a corpus, with a partition variable to represent
# the two sets of texts you want to compare
inaugCorp <- corpus(inaugTexts, 
                docvars = data.frame(docset = c(rep(1, 29), rep(2, 29))),
                notes = "Example made for stackoverflow")
# summarize the corpus
summary(inaugCorp, 5)


inaugDfm <- dfm(comment_corpus, 
            groups = "docset", # by docset instead of document
            remove = c("<p>", "http://", "www", stopwords("english")),
            remove_punct = TRUE,
            remove_numbers = TRUE,
            stem = TRUE)

R-比較兩個語料庫以創建一個新的語料庫，該語料庫具有來自語料庫＃1的較高頻率的單詞

問題描述

2 個解決方案

解決方案1
3 2015-05-30 21:54:58

解決方案2
0 2017-04-06 05:15:07

R-比較兩個語料庫以創建一個新的語料庫，該語料庫具有來自語料庫＃1的較高頻率的單詞

問題描述

2 個解決方案

解決方案1 3 2015-05-30 21:54:58

解決方案2 0 2017-04-06 05:15:07

解決方案1
3 2015-05-30 21:54:58

解決方案2
0 2017-04-06 05:15:07