简体   繁体   English

R: wordcloud包从语料中省略三个字符以下的单词

[英]R: wordcloud package omitting words below three characters from corpus

When creating a wordcloud using the wordcloud package it seems like the package defaults to omitting words below three characters (such as "tv").使用 wordcloud 包创建 wordcloud 时,该包似乎默认省略三个字符以下的单词(例如“tv”)。 I think this is a feature rather than a bug, but still I could not find an argument that adjusts the minimum character count.我认为这是一个功能而不是一个错误,但我仍然找不到调整最小字符数的参数。

The wordcloud is run against a corpus of words created and preprocessed with the Corpus() and tm_map() functions from the tm package. wordcloud 是针对使用 tm 包中的Corpus()tm_map()函数创建和预处理的语料库运行的。 I have confirmed that the words in question have not gotten lost when eg removing stopwords - they are still in the final corpus on which the wordcloud() function is run.我已经确认有问题的单词在例如删除停用词时没有丢失 - 它们仍在运行wordcloud()函数的最终语料库中。

Reproducible example [edit]可重现的例子[编辑]

Real data obviously looks different.真实数据显然看起来不同。 However, the lines below replicates the error.但是,下面的行复制了错误。

customPalette <- brewer.pal(4, "Dark2")

wordVector <- c(rep("tv", 15), rep("computer", 4), rep("phone", 16), rep("tablet",10))
newCorpus <- Corpus(VectorSource(wordVector))

wordcloud(newCorpus, max.words = 100, scale=c(8,1), random.order = FALSE, random.color = TRUE, colors = customPalette)

This creates output:这将创建输出:

词云输出

Session info:会话信息:

R version 3.3.2 (2016-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Amazon Linux AMI 2016.09

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2             zoo_1.8-0                wordcloud_2.5            RColorBrewer_1.1-2       SnowballC_0.5.1          tm_0.7-1                
 [7] NLP_0.1-10               reshape2_1.4.2           lubridate_1.6.0          scales_0.4.1             ggplot2_2.2.1            aws.s3_0.3.3            
[13] githubinstall_0.2.1.9001 aws.signature_0.3.2      RJDBC_0.2-5              rJava_0.9-8              DBI_0.7                  RCurl_1.95-4.8          
[19] bitops_1.0-6             jsonlite_1.5             dplyr_0.7.0              sparklyr_0.5.6           drat_0.1.2               devtools_1.13.2         

loaded via a namespace (and not attached):
 [1] slam_0.1-40       lattice_0.20-34   colorspace_1.3-2  htmltools_0.3.6   yaml_2.1.14       base64enc_0.1-3   rlang_0.1.1       glue_1.1.1       
 [9] withr_1.0.2       dbplyr_1.0.0      bindr_0.1         plyr_1.8.4        stringr_1.2.0     munsell_0.4.3     gtable_0.2.0      memoise_1.1.0    
[17] labeling_0.3      httpuv_1.3.3      parallel_3.3.2    curl_2.6          Rcpp_0.12.11      xtable_1.8-2      backports_1.1.0   config_0.2       
[25] mime_0.5          digest_0.6.12     stringi_1.1.5     shiny_1.0.3       rprojroot_1.2     grid_3.3.2        tools_3.3.2       magrittr_1.5     
[33] lazyeval_0.2.0    tibble_1.3.3      pkgconfig_2.0.1   data.table_1.10.4 xml2_1.1.1        assertthat_0.2.0  httr_1.2.1        rstudioapi_0.6   
[41] R6_2.2.2          git2r_0.18.0

The problem appears whether using the vector wordVector or with the corpus version.无论是使用向量wordVector还是使用语料库版本, wordVector出现问题。 This seems to be intended behaviour - see comment below from the package maintainer.这似乎是预期的行为 - 请参阅下面来自包维护者的评论。

The following alternative approach works, using the ability of wordcloud to take a vector of words and their frequencies separately...以下替代方法有效,使用wordcloud的能力分别wordcloud单词及其频率的向量......

worddf <- as.data.frame(table(newCorpus$content))
wordcloud(words = worddf[,1], freq = worddf[,2], max.words = 100, scale=c(8,1), 
          random.order = FALSE, random.color = TRUE, colors = customPalette)

在此处输入图片说明

What I have observed is that if your input text has any word with frequency greater than or equal to the default value of min freq, ie 3, all other words having freq less than 3 will be ignored.我观察到的是,如果您的输入文本中有任何单词的频率大于或等于 min freq 的默认值,即 3,则所有其他频率小于 3 的单词将被忽略。

However, if your input text doesn't have any word with freq >= 3, all words are considered for plotting.但是,如果您的输入文本没有任何 freq >= 3 的单词,则所有单词都被考虑用于绘图。 So in such cases, always update the min.freq argument and set it to your desired value.因此,在这种情况下,请始终更新 min.freq 参数并将其设置为您想要的值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM