使用R编程创建的词云中包含不必要的词

Question

I am trying to create some word cloud in R, which I am managing well so far with the exception of one little problem.我正在尝试在 R 中创建一些词云，到目前为止我管理得很好，除了一个小问题。 I don't know where these words/symbols are coming from, but the following words are also getting displayed in my word cloud:我不知道这些词/符号是从哪里来的，但我的词云中也显示了以下词：

"language" “语”
"en" “恩”
"=" “=”

and I can't seem to remove them.These words/symbols are not part of the original text and don't know why and how they are getting displayed in my word cloud and really need help in understanding how I can remove such unwanted words and why they are there.而且我似乎无法删除它们。这些单词/符号不是原始文本的一部分，也不知道它们为什么以及如何显示在我的词云中，真的需要帮助来理解如何删除这些不需要的单词以及为什么他们在那里。 Below, I am attaching a screen shot of my word cloud for clarity, and added blue arrows to show where those words/symbols are located in the word cloud.下面，为了清晰起见，我附上了我的词云的屏幕截图，并添加了蓝色箭头以显示这些词/符号在词云中的位置。 I am also attaching my lines of code and the text I used for creating the word cloud.我还附上了我的代码行和我用来创建词云的文本。 Any help is much appreciated and many thanks.非常感谢任何帮助，非常感谢。

    the_txt <-  "
  - The wealthiest country\n
  - The highest proportion of wealthy population (population aged 40-49)\n
  - The highest numbers of "rich business men and women" and "rich soil and land"\n
  - The country with the highest "employed populstion" and "self employed" numbers

  "

    mydata <- Corpus(VectorSource(the_txt))


mydata <- mydata %>%
    tm_map(removeNumbers) %>%
    tm_map(removePunctuation) %>%
    tm_map(stripWhitespace)

mydata <- tm_map(mydata, content_transformer(tolower))

mydata <- tm_map(mydata, removeNumbers)

mydata <- tm_map(mydata, removeWords, stopwords("english"))

mydata <- tm_map(mydata, stemDocument)


as.character(mydata[[1]])

minfreq_trigram<-1

token_delim <- " \\t\\r\\n.!?,;\"()"

tritoken <- NGramTokenizer(my data, Weka_control(min=1,max=3, delimiters = token_delim))

three_word <- data.frame(table(tritoken))

sort_three <- three_word[order(three_word$Freq, decreasing=TRUE),]

set.seed(1234)

wordcloud(sort_three$tritoken, sort_three$Freq, 
              random.order=FALSE, scale = c(3,0.4),
              min.freq = minfreq_trigram,
              colors = brewer.pal(8,"Dark2"),
              max.words=200)

Answer 1

> as.character(mydata)
[1] "wealthiest countri highest proport wealthi popul popul age highest number rich busi men women rich soil land countri highest employ populst self employ number"
[2] "list(language = \"en\")"                                                                                                                                       
[3] "list()"

you checked mydata[[1]], explicitly looking at a part of mydata, but the rest has content, that you fed into NGramTokenizer and ultimitaly the wordcloud.你检查了 mydata[[1]]，明确地查看了 mydata 的一部分，但其余部分有内容，你将其输入 NGramTokenizer 并最终输入 wordcloud。 If you want to pass mydata[[1]]] instead of mydata I would think that would work out for you, and is a straightforward approach.如果您想传递 mydata[[1]]] 而不是 mydata 我认为这对您有用，并且是一种直接的方法。 I think the recommended approach is to use content()我认为推荐的方法是使用content()

ie IE

mycontent <- content(mydata)

to get the character vector out得到字符向量

使用R编程创建的词云中包含不必要的词

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-12-20 10:56:29

使用R编程创建的词云中包含不必要的词

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-12-20 10:56:29

解决方案1
1 已采纳 2022-12-20 10:56:29