简体   繁体   中英

Word cloud in R with multiple words and special characters

I want to create a wordcloud with R. I want to visualize the occurence of variable names, which may consist of more than one word and also special characters and numbers, for example one variable name is "S & P 500 dividend yield".

The variable names are in a text file and they are no further separated. Every line of the text file contains a new variable name.

I tried the folowing code, however the variable names are split into different characters:

library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)


# load the text:
text <- readLines("./Overview_used_series.txt")
docs <- Corpus(VectorSource(text))
inspect(docs)

# build a term-document matrix:
tdm <- TermDocumentMatrix(docs)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)


# generate the wordcloud:
pdf("Word cloud.pdf")
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
      max.words=200, random.order=FALSE, rot.per=0.35, 
      colors=brewer.pal(8, "Dark2"))
dev.off()

How can I treat the variable names, so that they are visualized in the wordcloud with their original names as in the text file?

If you have a file as you specified with a variable name per line, there is no need to use tm. You can easily create your own word frequency table to use as input. When using tm, it will split words based a space and will not respect your variable names.

Starting from when the text is loaded, just create a data.frame with where frequency is set to 1 and then you can just aggregate everything. wordcloud also accepts data.frame like this and you can just create a wordcloud from this. Note that I adjusted the scale a bit, because when you have long variable names, they might not get printed. You will get a warning message when this happens.

I'm not inserting the resulting picture.

#text <- readLines("./Overview_used_series.txt")
text <- c("S & P 500 dividend yield", "S & P 500 dividend yield", "S & P 500 dividend yield", 
          "visualize ", "occurence ", "variable names", "visualize ", "occurence ", 
          "variable names")

# freq = 1 adds a columns with just 1's for every value.
my_data <- data.frame(text = text, freq = 1, stringsAsFactors = FALSE)

# aggregate the data.    
my_agr <- aggregate(freq ~ ., data = my_data, sum)

wordcloud(words = my_agr$text, freq = my_agr$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"), scale = c(2, .5))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM