简体   繁体   English

R-比较两个语料库以创建一个新的语料库,该语料库具有来自语料库#1的较高频率的单词

[英]R - comparing two corpuses to create a NEW corpus with words with higher frequency from corpus #1

I have two corpuses that contain similar words. 我有两个包含相似词的语料库。 similar enough that using setdiff doesn't really help my cause. 足够相似,使用setdiff并不能真正解决我的问题。 So I've turned towards finding a way to extract a list or corpus (to eventually make a wordcloud) of words that are more frequent (assuming something like this would have a threshold - so maybe like 50% more frequent?) in corpus #1, compared to corpus #2. 因此,我已经转向寻找一种方法来提取语料库中频率更高的列表或语料库(以最终形成单词云)(假设像这样会有阈值-也许像这样更频繁50%?) 1,与语料库2相比。

This is everything I have right now: 这就是我现在拥有的一切:

> install.packages("tm")
> install.packages("SnowballC")
> install.packages("wordcloud")
> install.packages("RColorBrewer")
> library(tm)
> library(SnowballC)
> library(wordcloud)
> library(RColorBrewer)

> UKDraft = read.csv("UKDraftScouting.csv", stringsAsFactors=FALSE)
> corpus = Corpus(VectorSource(UKDraft$Report))
> corpus = tm_map(corpus, tolower)
> corpus = tm_map(corpus, PlainTextDocument)
> corpus = tm_map(corpus, removePunctuation)
> corpus = tm_map(corpus, removeWords, c("strengths", "weaknesses", "notes",  "kentucky", "wildcats", stopwords("english")))
> frequencies = DocumentTermMatrix(corpus)
> allReports = as.data.frame(as.matrix(frequencies))

> SECDraft = read.csv("SECMinusUKDraftScouting.csv", stringsAsFactors=FALSE)
> SECcorpus = Corpus(VectorSource(SECDraft$Report))
> SECcorpus = tm_map(SECcorpus, tolower)
> SECcorpus = tm_map(SECcorpus, PlainTextDocument)
> SECcorpus = tm_map(SECcorpus, removePunctuation)
> SECcorpus = tm_map(SECcorpus, removeWords, c("strengths", "weaknesses", "notes", stopwords("english")))
> SECfrequencies = DocumentTermMatrix(SECcorpus)
> SECallReports = as.data.frame(as.matrix(SECfrequencies))

So if the word "wingspan" has a 100 count frequency in corpus#2 ('SECcorpus') but 150 count frequency in corpus#1 ('corpus'), we would want that word in our resulting corpus/list. 因此,如果单词“ wingspan”在语料库2(“ SECcorpus”)中具有100个计数频率,而在语料库1(“ corpus”)中具有150个计数频率,那么我们希望在生成的语料库/列表中使用该词。

I can suggest a method that might be more straightforward, based on the new text analysis package I developed with Paul Nulty. 我可以根据我与Paul Nulty共同开发的新文本分析包,提出一种可能更简单的方法。 It's called quanteda, available on CRAN and GitHub . 它称为Quanteda,可在CRAN和GitHub上使用

I don't have access to your texts, but this will work in a similar fashion for your examples. 我无权访问您的文本,但是对于您的示例,这将以类似的方式工作。 You create a corpus of your two sets of documents, then add a document variable (using docvars ), and then create a document feature matrix grouping on the new document partition variable. 创建两个文档集的语料库,然后添加一个文档变量(使用docvars ),然后在新文档分区变量上创建一个文档特征矩阵分组。 The rest of the operations are straightforward, see the code below. 其余操作很简单,请参见下面的代码。 Note that by default, dfm objects are sparse Matrixes, but subsetting on features is not yet implemented (next release!). 请注意,默认情况下, dfm对象是稀疏矩阵,但是功能的子集尚未实现(下一版本!)。

install.packages(quanteda)
library(quanteda)

# built-in character vector of 57 inaugural addreses
str(inaugTexts)

# create a corpus, with a partition variable to represent
# the two sets of texts you want to compare
inaugCorp <- corpus(inaugTexts, 
                    docvars = data.frame(docset = c(rep(1, 29), rep(2, 28))),
                    notes = "Example made for stackoverflow")
# summarize the corpus
summary(inaugCorp, 5)

# toLower, removePunct are on by default
inaugDfm <- dfm(inaugCorp, 
                groups = "docset", # by docset instead of document
                ignoredFeatures = c("strengths", "weaknesses", "notes", stopwords("english"))),
                matrixType = "dense")

# now compare frequencies and trim based on ratio threshold
ratioThreshold <- 1.5
featureRatio <- inaugDfm[2, ] / inaugDfm[1, ]
# to select where set 2 feature frequency is 1.5x set 1 feature frequency
inaugDfmReduced <- inaugDfm[2, featureRatio >= ratioThreshold]

# plot the wordcloud
plot(inaugDfmReduced)

I would recommend you pass through some options to wordcloud() (what plot.dfm() uses), perhaps to restrict the minimum number of features to be plotted. 我建议您将一些选项传递给wordcloud()plot.dfm()使用什么),也许是为了限制要绘制的最小特征数。

Very happy to assist with any queries you might have on using the quanteda package. 很高兴为您提供有关使用quanteda软件包的任何查询的帮助。

New

Here's a stab directly at your problem. 这是直接解决您的问题的方法。 I don't have your files so cannot verify that it works. 我没有您的文件,因此无法验证它是否有效。 Also if your R skills are limited, you might find this challenging to understand; 同样,如果您的R技能有限,您可能会发现很难理解。 ditto if you have not looked at any of the (sadly limited for now) documentation for quanteda . 同上(如果您还没有看过quanteda任何文档)(目前非常有限)。

I think what you need (based on your comment/query) is the following: 我认为您需要的是基于您的评论/查询的以下内容:

# read in each corpus separately, directly into quanteda
mycorpus1 <- corpus(textfile("UKDraftScouting.csv", textField = "Report"))
mycorpus2 <- corpus(textfile("SECMinusUKDraftScouting.csv", textField = "Report"))
# assign docset variables to each corpus as appropriate 
docvars(mycorpus1, "docset") <- 1 
docvars(mycorpus2, "docset") <- 2
myCombinedCorpus <- mycorpus1 + mycorpus2

then proceed with the dfm step as above, substituting myCombinedCorpus for inaugTexts . 然后继续上述dfm步骤,用myCombinedCorpus替换inaugTexts

I am updating the answer by @ken Benoit, as it was several years old and quanteda package has gone through some major changes in syntax. 我正在更新@ken Benoit的答案,因为它已经有好几年历史了,而且Quanteda软件包在语法上进行了一些重大更改。

The current version should be (April 2017): 当前版本应为(2017年4月):

str(inaugTexts)

# create a corpus, with a partition variable to represent
# the two sets of texts you want to compare
inaugCorp <- corpus(inaugTexts, 
                docvars = data.frame(docset = c(rep(1, 29), rep(2, 29))),
                notes = "Example made for stackoverflow")
# summarize the corpus
summary(inaugCorp, 5)


inaugDfm <- dfm(comment_corpus, 
            groups = "docset", # by docset instead of document
            remove = c("<p>", "http://", "www", stopwords("english")),
            remove_punct = TRUE,
            remove_numbers = TRUE,
            stem = TRUE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM