简体   繁体   English

R中每个文档的词频

[英]Word frequency per document in R

I have this following sample data frame 我有以下以下示例数据框

   comments                date
1 i want to hear that   2010-11-01
2 lets get started      2008-03-25
3 i want to get started 2007-03-14

I want to get word frequency from all the documents and also, i want to store the document number (1, 2 or 3) in which the word appeared in. 我想从所有文档中获取词频,而且我想存储单词出现的文档编号(1、2或3)。
The output should be a matrix that would have words in one column, their frequency in other and the document number in 3rd. 输出应为一个矩阵,其中一列将包含单词,另一列将包含单词的频率,第三列将包含文档编号。

I tried the nomal tm package but it isnt working in my case. 我尝试了正常的tm软件包,但在我的情况下不起作用。

And using the tm package and tidyr 并使用tm包和tidyr

library(tm)
library(tidyr)

df <- data.frame(id = c(1, 2, 3),
                 comments = c("that is that", "lets get started", "i want to get started"),
                 date = as.Date(c("2010-11-01", "2008-03-25", "2007-03-14")), stringsAsFactors = FALSE)

corpus <- Corpus(VectorSource(df$comments))
dtm <- DocumentTermMatrix(corpus, control=list(wordLengths=c(1, Inf)))

my_data <- data.frame(as.matrix(dtm), id = df$id, date = df$date)

outcome <- gather(my_data, words, freq, -id, -date)
head(outcome)

  id       date words freq
1  1 2010-11-01   get    0
2  2 2008-03-25   get    1
3  3 2007-03-14   get    1
4  1 2010-11-01     i    0
5  2 2008-03-25     i    0
6  3 2007-03-14     i    1

I've been working with data.table plus stringi a bit more recently so I thought I'd throw these solutions up that are similar to the dplyr solution but may give a nice speed boost with larger data sets. 我最近在使用data.table plus stringi,所以我认为我会抛弃这些类似于dplyr解决方案的解决方案,但是在使用更大的数据集时可能会大大提高速度。

dat <- data.frame(
    comments= c("i want to hear that", "lets get started", "i want to get started"),
    date = as.Date(c("2010-11-01", "2008-03-25", "2007-03-14")), stringsAsFactors = FALSE
)


library(data.table); library(stringi)
setDT(dat)

dat[, list(word = unlist(stri_extract_all_words(comments)))][, 
    list(freq=.N), by = 'word'][order(word),]

##       word freq
## 1:     get    2
## 2:    hear    1
## 3:       i    2
## 4:    lets    1
## 5: started    2
## 6:    that    1
## 7:      to    2
## 8:    want    2


dat[, list(word = unlist(stri_extract_all_words(comments))), by="date"][, 
    list(freq=.N), by = c('date', 'word')][order(date, word),]

##           date    word freq
##  1: 2007-03-14     get    1
##  2: 2007-03-14       i    1
##  3: 2007-03-14 started    1
##  4: 2007-03-14      to    1
##  5: 2007-03-14    want    1
##  6: 2008-03-25     get    1
##  7: 2008-03-25    lets    1
##  8: 2008-03-25 started    1
##  9: 2010-11-01    hear    1
## 10: 2010-11-01       i    1
## 11: 2010-11-01    that    1
## 12: 2010-11-01      to    1
## 13: 2010-11-01    want    1
library(dplyr)
library(tidyr)
library(stringi)

word__date = 
  data_frame(
    comments= c("i want to hear that", "lets get started", "i want to get started"),
    date = c("2010-11-01", "2008-03-25", "2007-03-14") %>% as.Date ) %>%
  mutate(word = comments %>% stri_split_fixed(pattern = " ")) %>%
  unnest(word) %>%
  group_by(word, date) %>%
  summarize(count = n())

word = 
  word__date %>%
  summarize(count = sum(count))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM