简体   繁体   中英

How to transform a Document Term Matrix in R?

Hello I have a document term matrix and I transformed it with the tidy() function and it works perfect. I want to plot a word cloud based on the frequency of a word. So my transformed table looks like this:

> head(Wcloud.Data)
# A tibble: 6 x 3
  document term       count
  <chr>    <chr>      <dbl>
1 1        accept         1
2 1        access         1
3 1        accomplish     1
4 1        account        4
5 1        accur          2
6 1        achiev         1

I have 33,647,383 observations so its a very big dataframe. If I use the max() function I am getting a very high number (64116) but no word in my dataframe has a frequency of 64116. Also if I plot the dataframe in shiny with wordcloud() it plots same words several times. Also if I want to sort my column count its not working - sort(Wcloud.Data$count,decreasing = TRUE) . So something is not correct but I dont know, what and how to solve it. Somebody has any idea?

Thas the summary of my document term matrix, before transform it into a dataframe:

> observations.tf
<<DocumentTermMatrix (documents: 76717, terms: 4234)>>
Non-/sparse entries: 33647383/291172395
Sparsity           : 90%
Maximal term length: 15
Weighting          : term frequency (tf)

Update: I add a picture of my dataframe

数据帧

产量

Using dplyr you can do as following:

library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

Wcloud.Data<- data.frame(Document= c(rep(1,6)), 
                         term = c("accept", "access","accomplish", "account", "accur", "achiev"),
                         count = c(1,1,1,4,2,1))

Data<-Wcloud.Data %>% 
  group_by(term) %>% 
  summarise(Frequency = sum(count))
set.seed(1234)
wordcloud(words = Data$term, freq = Data$Frequency, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

在此处输入图片说明

On the other side, libraries quanteda and tibble can help you creting the term frequency matrix. I will put you an example to work with it:

library(tibble)
library(quanteda)
Data <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "this is china",
                              "china is here",
                              'hello china',
                              "Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "this is china",
                              "china is here",
                              'hello china',
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              'japan'))
DocTerm <- quanteda::dfm(Data$text)
DocTerm
# Document-feature matrix of: 19 documents, 11 features (78.5% sparse).
# 19 x 11 sparse Matrix of class "dfm"
# features
# docs     chinese beijing shanghai this is china here hello kyoto japan tokyo
# text1        2       1        0    0  0     0    0     0     0     0     0
# text2        2       0        1    0  0     0    0     0     0     0     0
# text3        0       0        0    1  1     1    0     0     0     0     0
# text4        0       0        0    0  1     1    1     0     0     0     0
# text5        0       0        0    0  0     1    0     1     0     0     0
# text6        2       1        0    0  0     0    0     0     0     0     0
# text7        2       0        1    0  0     0    0     0     0     0     0
# text8        0       0        0    1  1     1    0     0     0     0     0
# text9        0       0        0    0  1     1    1     0     0     0     0
# text10       0       0        0    0  0     1    0     1     0     0     0
# text11       0       0        0    0  0     0    0     0     1     1     0
# text12       1       0        0    0  0     0    0     0     0     1     1
# text13       0       0        0    0  0     0    0     0     1     1     0
# text14       1       0        0    0  0     0    0     0     0     1     1
# text15       0       0        0    0  0     0    0     0     1     1     0
# text16       1       0        0    0  0     0    0     0     0     1     1
# text17       0       0        0    0  0     0    0     0     1     1     0
# text18       1       0        0    0  0     0    0     0     0     1     1
# text19       0       0        0    0  0     0    0     0     0     1     0

Mat<-quanteda::convert(DocTerm,"data.frame")[,2:ncol(DocTerm)] # Converting to a Dataframe without taking into account the text variable
Result<- colSums(Mat) # This is what you are interested in
names(Result)<-colnames(Mat)
# > Result
# chinese  beijing shanghai     this       is    china     here    hello    kyoto    japan 
# 24        4        4        4        8       12        4        4        8       18 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM