用戶在R中隨時間變化的詞頻

Question

我的目標是制作一個隨時間變化的詞頻圖表。 我大約有36000個用戶評論的單獨條目以及相關的日期。 我在這里有25個用戶樣本： http : //pastebin.com/kKfby5kf

我正在嘗試獲取給定日期上最常用的單詞（也許是前10個）。 我覺得我的方法很接近，但不太正確：

    library("tm")

frequencylist <- list(0)

for(i in unique(sampledf[,2])){

  subset <- subset(sampledf, sampledf[,2]==i)

  comments <- as.vector(subset[,1])
  verbatims <- Corpus(VectorSource(comments))
  verbatims <- tm_map(verbatims, stripWhitespace)
  verbatims <- tm_map(verbatims, content_transformer(tolower))
  verbatims <- tm_map(verbatims, removeWords, stopwords("english"))
  verbatims <- tm_map(verbatims, removePunctuation)

  stopwords2 <- c("game")
  verbatims2 <- tm_map(verbatims, removeWords, stopwords2)
  dtm <- DocumentTermMatrix(verbatims2)
  dtm2 <- as.matrix(dtm)
  frequency <- colSums(dtm2)
  frequency <- sort(frequency, decreasing=TRUE)
  frequencydf <- data.frame(frequency)
  frequencydf$comments <- row.names(frequencydf)
  frequencydf$date <- i

  frequencylist[[i]] <- frequencydf 
}

我的瘋狂的解釋：pastebin示例進入sampledf。 對於示例中的每個唯一日期，我都試圖獲得一個詞頻。 然后，我嘗試將列表頻率存儲在列表中（盡管可能不是最好的方法）。 首先，我按日期進行子集設置，然后去除空格，常用英語單詞，標點符號並全部小寫。 然后，我對“游戲”進行了另一次單詞去除操作，因為它不太有趣，但很常見。 為了獲得單詞頻率，然后將其傳遞到文檔術語矩陣中並執行一個簡單的colSums() 。 然后，我追加該表的日期，然后嘗試將其存儲在列表中。

我不確定我的策略一開始是否有效。 有沒有更簡單，更好的方法來解決此問題？

Answer 1

評論者是正確的，因為有更好的方法來建立可復制的示例。 另外，您的答案可能會更具體地說明您要完成的輸出結果。 （我無法正確執行您的代碼。）

但是：您要求一種更簡單，更好的方法。 我認為這兩者都是。 它使用Quanteda文本包，並在創建文檔功能矩陣時利用groups功能。 然后，它在“ dfm”上執行一些排名，以獲取您所需的每日學期排名。

請注意，這是基於我使用read.delim("sampledf.tsv", stringsAsFactors = FALSE)加載了鏈接數據的基礎。

require(quanteda)
# create a corpus with a date document variable
myCorpus <- corpus(sampledf$content_strip, 
                   docvars = data.frame(date = as.Date(sampledf$postedDate_fix, "%M/%d/%Y")))

# construct a dfm, group on date, and remove stopwords plus the term "game"
myDfm <- dfm(myCorpus, groups = "date", ignoredFeatures = c("game", stopwords("english")))
## Creating a dfm from a corpus ...
## ... grouping texts by variable: date
## ... lowercasing
## ... tokenizing
## ... indexing documents: 20 documents
## ... indexing features: 198 feature types
## ... removed 47 features, from 175 supplied (glob) feature types
## ... created a 20 x 151 sparse dfm
## ... complete. 
## Elapsed time: 0.009 seconds.

myDfm <- sort(myDfm) # not required, just for presentation
# remove a really nasty long term
myDfm <- removeFeatures(myDfm, "^a{10}", valuetype = "regex")
## removed 1 feature, from 1 supplied (regex) feature types

# make a data.frame of the daily ranks of each feature
featureRanksByDate <- as.data.frame(t(apply(myDfm, 1, order, decreasing = TRUE)))
names(featureRanksByDate) <- features(myDfm)
featureRanksByDate[, 1:10]
##              â great nice play  go will can get ever first
## 2013-10-02   1    18   19   20  21   22  23  24   25    26
## 2013-10-04   3     1    2    4   5    6   7   8    9    10
## 2013-10-05   3     9   28   29   1    2   4   5    6     7
## 2013-10-06   7     4    8   10  11   30  31  32   33    34
## 2013-10-07   5     1    2    3   4    6   7   8    9    10
## 2013-10-09  12    42   43    1   2    3   4   5    6     7
## 2013-10-13   1    14    6    9  10   13  44  45   46    47
## 2013-10-16   2     3   84   85   1    4   5   6    7     8
## 2013-10-18  15     1    2    3   4    5   6   7    8     9
## 2013-10-19   3    86    1    2   4    5   6   7    8     9
## 2013-10-22   2    87   88   89  90   91  92  93   94    95
## 2013-10-23  13    98   99  100 101  102 103 104  105   106
## 2013-10-25   4     6    5   12  16  109 110 111  112   113
## 2013-10-27   8     4    6   15  17  124 125 126  127   128
## 2013-10-30  11     1    2    3   4    5   6   7    8     9
## 2014-10-01   7    16  139    1   2    3   4   5    6     8
## 2014-10-02 140     1    2    3   4    5   6   7    8     9
## 2014-10-03 141   142  143    1   2    3   4   5    6     7
## 2014-10-05 144   145  146  147 148    1   2   3    4     5
## 2014-10-06  17   149  150    1   2    3   4   5    6     7

# top n features by day
n <- 10 
as.data.frame(apply(featureRanksByDate, 1, function(x) {
    todaysTopFeatures <- names(featureRanksByDate)
    names(todaysTopFeatures) <- x
    todaysTopFeatures[as.character(1:n)]
}), row.names = 1:n)
##    2013-10-02 2013-10-04 2013-10-05 2013-10-06 2013-10-07 2013-10-09 2013-10-13 2013-10-16 2013-10-18 2013-10-19 2013-10-22 2013-10-23
## 1           â      great         go     triple      great       play          â         go      great       nice       year       year
## 2         win       nice       will      niple       nice         go    created          â       nice       play          â       give
## 3        year          â          â   backflip       play       will      wasnt      great       play          â       give       good
## 4        give       play        can      great         go        can      money       will         go         go       good       hard
## 5        good         go        get      scope          â        get     prizes        can       will       will       hard       time
## 6        hard       will       ever       ball       will       ever       nice        get        can        can       time     triple
## 7        time        can      first          â        can      first      piece       ever        get        get     triple      niple
## 8      triple        get        fun       nice        get        fun       dead      first       ever       ever      niple   backflip
## 9       niple       ever      great   testical       ever        win       play        fun      first      first   backflip      scope
## 10   backflip      first        win       play      first       year         go        win        fun        fun      scope       ball
##    2013-10-25 2013-10-27 2013-10-30 2014-10-01 2014-10-02 2014-10-03 2014-10-05 2014-10-06
## 1       scope      scope      great       play      great       play       will       play
## 3    testical   testical       play       will       play       will        get       will
## 2        ball       ball       nice         go       nice         go        can         go
## 4           â      great         go        can         go        can       ever        can
## 5        nice       shot       will        get       will        get      first        get
## 6       great       nice        can       ever        can       ever        fun       ever
## 7        shot       head        get          â        get      first        win      first
## 8        head          â       ever      first       ever        fun       year        fun
## 9     dancing    dancing      first        fun      first        win       give        win
## 10        cow        cow        fun        win        fun       year       good       year

順便說一句有趣的單詞和證詞拼寫。

用戶在R中隨時間變化的詞頻

問題描述

1 個解決方案

解決方案1
0 2015-10-23 06:00:40

用戶在R中隨時間變化的詞頻

問題描述

1 個解決方案

解決方案1 0 2015-10-23 06:00:40

解決方案1
0 2015-10-23 06:00:40