计算 R 中多词的词频？

Question

I'm trying to compute the frequency of multi-words in a given text.我正在尝试计算给定文本中多词的频率。 For instance, consider the text: "Environmental Research Environmental Research Environmental Research study science energy, economics, agriculture, ecology, and biology".例如，考虑以下文本：“环境研究环境研究环境研究研究科学能源、经济学、农业、生态学和生物学”。 And then I want the number of times the combined words "environmental research" occurs in the text.然后我想要文本中出现“环境研究”这个组合词的次数。 Here is the code that I've tried.这是我尝试过的代码。

library(tm)
#Reading the data
text = readLines(file.choose())
text1 = Corpus(VectorSource(text))

#Cleaning the data
text1 = tm_map(text1, content_transformer(tolower))
text1 = tm_map(text1, removePunctuation)
text1 = tm_map(text1, removeNumbers)
text1 = tm_map(text1, stripWhitespace)
text1 = tm_map(text1, removeWords, stopwords("english"))

#Making a document matrix
dtm = TermDocumentMatrix(text1)
m11 = as.matrix(text1)
freq11 = sort(rowSums(m11), decreasing=TRUE)
d11 = data.frame(word=names(freq11), freq=freq11)
head(d11,9)

This code, however, produces the frequency of each word separately.但是，此代码分别生成每个单词的频率。 Instead, how do I obtain the number of times "environmental research" occurs together in the text?相反，我如何获得“环境研究”在文本中一起出现的次数？ Thanks!谢谢！

Answer 1

If you have a list of multiwords already and you want to compute their frequency in a text, you can use str_extract_all :如果你已经有一个多词列表并且你想计算它们在文本中的频率，你可以使用str_extract_all ：

text <- "Environmental Research Environmental Research Environmental Research study science energy, economics, agriculture, ecology, and biology"

library(stringr)
str_extract_all(text, "[Ee]nvironmental [Rr]esearch")
[[1]]
[1] "Environmental Research" "Environmental Research" "Environmental Research"

If you want to know how often the multiword occurs you can do this:如果您想知道多词出现的频率，您可以这样做：

length(unlist(str_extract_all(text, "[Ee]nvironmental [Rr]esearch")))
[1] 3

If you're interested in extracting all multiwords at once you can proceed like this:如果您有兴趣一次提取所有多字，可以这样进行：

First define a vector with all multiwords:首先定义一个包含所有多字的向量：

multiwords <- c("[Ee]nvironmental [Rr]esearch", "study science energy")

Then use paste0 to collapse them into a single string of alternative patterns and use str_extract_all on that string:然后使用paste0将它们折叠成单个替代模式字符串，并在该字符串上使用str_extract_all ：

str_extract_all(text, paste0(multiwords, collapse = "|"))
[[1]]
[1] "Environmental Research" "Environmental Research" "Environmental Research" "study science energy"

To get the frequencies of the multiwords you can use table :要获取多词的频率，您可以使用table ：

table(str_extract_all(text, paste0(multiwords, collapse = "|")))

Environmental Research   study science energy 
                     3                      1

计算 R 中多词的词频？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-05-09 17:56:15

计算 R 中多词的词频？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-05-09 17:56:15

解决方案1
1 已采纳 2020-05-09 17:56:15