简体   繁体   English

创建短语的“单词”云,而不是R中的单个单词

[英]Creating “word” cloud of phrases, not individual words in R

I am trying to make a word cloud from a list of phrases, many of which are repeated, instead of from individual words. 我试图从一个短语列表中创建一个词云,其中许多是重复的,而不是单个词。 My data looks something like this, with one column of my data frame being a list of phrases. 我的数据看起来像这样,我的数据框的一列是短语列表。

df$names <- c("John", "John", "Joseph A", "Mary A", "Mary A", "Paul H C", "Paul H C")

I would like to make a word cloud where all of these names are treated as individual phrases whose frequency is displayed, not the words which make them up. 我想制作一个词云,其中所有这些名称都被视为显示频率的单个短语,而不是构成它们的单词。 The code I have been using looks like: 我一直在使用的代码如下:

df.corpus <- Corpus(DataframeSource(data.frame(df$names)))
df.corpus <- tm_map(client.corpus, function(x) removeWords(x, stopwords("english")))
#turning that corpus into a tDM
tdm <- TermDocumentMatrix(df.corpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(9, "BuGn")
pal <- pal[-(1:2)]
#making a worcloud
png("wordcloud.png", width=1280,height=800)
wordcloud(d$word,d$freq, scale=c(8,.3),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors="black", vfont=c("sans serif","plain"))
dev.off()

This creates a word cloud, but it is of each component word, not of the phrases. 这会创建一个单词云,但它是每个组成单词,而不是短语。 So, I see the relative frequency of "A". 所以,我看到“A”的相对频率。 "H", "John" etc instead of the relative frequency of "Joseph A", "Mary A", etc, which is what I want. “H”,“John”等而不是“Joseph A”,“Mary A”等的相对频率,这就是我想要的。

I'm sure this isn't that complicated to fix, but I can't figure it out! 我确信修复并不复杂,但我无法理解! I would appreciate any help. 我将不胜感激任何帮助。

Your difficulty is that each element of df$names is being treated as "document" by the functions of tm . 您的困难在于, df$names每个元素都被tm的函数视为“文档”。 For example, the document John A contains the words John and A . 例如,文档John A包含JohnA这两个词。 It sounds like you want to keep the names as is, and just count up their occurrence - you can just use table for that. 听起来你想保持原样名称,并且只计算它们的出现次数 - 你可以只使用table

library(wordcloud)
df<-data.frame(theNames=c("John", "John", "Joseph A", "Mary A", "Mary A", "Paul H C", "Paul H C"))
tb<-table(df$theNames)
wordcloud(names(tb),as.numeric(tb), scale=c(8,.3),min.freq=1,max.words=100, random.order=T, rot.per=.15, colors="black", vfont=c("sans serif","plain"))

在此输入图像描述

Install RWeka and its dependencies, then try this: 安装RWeka及其依赖项,然后尝试:

library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
# ... other tokenizers
tok <- BigramTokenizer
tdmgram <- TermDocumentMatrix(df.corpus, control = list(tokenize = tok))
#... create wordcloud

The tokenizer-line above chops your text into phrases of length 2. 上面的tokenizer-line将你的文本切成长度为2的短语。
More specifically, it creates phrases of minlength 2 and maxlength 2. 更具体地说,它创建了minlength 2和maxlength 2的短语。
Using Weka's general NGramTokenizer Algorithm, You can create different tokenizers (eg minlength 1, maxlength 2), and you'll probably want to experiment with different lengths. 使用Weka的通用NGramTokenizer算法,您可以创建不同的标记器(例如minlength 1,maxlength 2),您可能希望尝试不同的长度。 You can also call them tok1, tok2 instead of the verbose "BigramTokenizer" I've used above. 您也可以将它们称为tok1,tok2,而不是我上面使用过的详细“BigramTokenizer”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM