[英]Word frequency over time by user in R
我的目標是制作一個隨時間變化的詞頻圖表 。 我大約有36000個用戶評論的單獨條目以及相關的日期。 我在這里有25個用戶樣本: http : //pastebin.com/kKfby5kf
我正在嘗試獲取給定日期上最常用的單詞(也許是前10個)。 我覺得我的方法很接近,但不太正確:
library("tm")
frequencylist <- list(0)
for(i in unique(sampledf[,2])){
subset <- subset(sampledf, sampledf[,2]==i)
comments <- as.vector(subset[,1])
verbatims <- Corpus(VectorSource(comments))
verbatims <- tm_map(verbatims, stripWhitespace)
verbatims <- tm_map(verbatims, content_transformer(tolower))
verbatims <- tm_map(verbatims, removeWords, stopwords("english"))
verbatims <- tm_map(verbatims, removePunctuation)
stopwords2 <- c("game")
verbatims2 <- tm_map(verbatims, removeWords, stopwords2)
dtm <- DocumentTermMatrix(verbatims2)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
frequencydf <- data.frame(frequency)
frequencydf$comments <- row.names(frequencydf)
frequencydf$date <- i
frequencylist[[i]] <- frequencydf
}
我的瘋狂的解釋:pastebin示例進入sampledf。 對於示例中的每個唯一日期,我都試圖獲得一個詞頻。 然后,我嘗試將列表頻率存儲在列表中(盡管可能不是最好的方法)。 首先,我按日期進行子集設置,然后去除空格,常用英語單詞,標點符號並全部小寫。 然后,我對“游戲”進行了另一次單詞去除操作,因為它不太有趣,但很常見。 為了獲得單詞頻率,然后將其傳遞到文檔術語矩陣中並執行一個簡單的colSums()
。 然后,我追加該表的日期,然后嘗試將其存儲在列表中。
我不確定我的策略一開始是否有效。 有沒有更簡單,更好的方法來解決此問題?
評論者是正確的,因為有更好的方法來建立可復制的示例。 另外,您的答案可能會更具體地說明您要完成的輸出結果。 (我無法正確執行您的代碼。)
但是:您要求一種更簡單,更好的方法。 我認為這兩者都是。 它使用Quanteda文本包,並在創建文檔功能矩陣時利用groups
功能。 然后,它在“ dfm”上執行一些排名,以獲取您所需的每日學期排名。
請注意,這是基於我使用read.delim("sampledf.tsv", stringsAsFactors = FALSE)
加載了鏈接數據的基礎。
require(quanteda)
# create a corpus with a date document variable
myCorpus <- corpus(sampledf$content_strip,
docvars = data.frame(date = as.Date(sampledf$postedDate_fix, "%M/%d/%Y")))
# construct a dfm, group on date, and remove stopwords plus the term "game"
myDfm <- dfm(myCorpus, groups = "date", ignoredFeatures = c("game", stopwords("english")))
## Creating a dfm from a corpus ...
## ... grouping texts by variable: date
## ... lowercasing
## ... tokenizing
## ... indexing documents: 20 documents
## ... indexing features: 198 feature types
## ... removed 47 features, from 175 supplied (glob) feature types
## ... created a 20 x 151 sparse dfm
## ... complete.
## Elapsed time: 0.009 seconds.
myDfm <- sort(myDfm) # not required, just for presentation
# remove a really nasty long term
myDfm <- removeFeatures(myDfm, "^a{10}", valuetype = "regex")
## removed 1 feature, from 1 supplied (regex) feature types
# make a data.frame of the daily ranks of each feature
featureRanksByDate <- as.data.frame(t(apply(myDfm, 1, order, decreasing = TRUE)))
names(featureRanksByDate) <- features(myDfm)
featureRanksByDate[, 1:10]
## â great nice play go will can get ever first
## 2013-10-02 1 18 19 20 21 22 23 24 25 26
## 2013-10-04 3 1 2 4 5 6 7 8 9 10
## 2013-10-05 3 9 28 29 1 2 4 5 6 7
## 2013-10-06 7 4 8 10 11 30 31 32 33 34
## 2013-10-07 5 1 2 3 4 6 7 8 9 10
## 2013-10-09 12 42 43 1 2 3 4 5 6 7
## 2013-10-13 1 14 6 9 10 13 44 45 46 47
## 2013-10-16 2 3 84 85 1 4 5 6 7 8
## 2013-10-18 15 1 2 3 4 5 6 7 8 9
## 2013-10-19 3 86 1 2 4 5 6 7 8 9
## 2013-10-22 2 87 88 89 90 91 92 93 94 95
## 2013-10-23 13 98 99 100 101 102 103 104 105 106
## 2013-10-25 4 6 5 12 16 109 110 111 112 113
## 2013-10-27 8 4 6 15 17 124 125 126 127 128
## 2013-10-30 11 1 2 3 4 5 6 7 8 9
## 2014-10-01 7 16 139 1 2 3 4 5 6 8
## 2014-10-02 140 1 2 3 4 5 6 7 8 9
## 2014-10-03 141 142 143 1 2 3 4 5 6 7
## 2014-10-05 144 145 146 147 148 1 2 3 4 5
## 2014-10-06 17 149 150 1 2 3 4 5 6 7
# top n features by day
n <- 10
as.data.frame(apply(featureRanksByDate, 1, function(x) {
todaysTopFeatures <- names(featureRanksByDate)
names(todaysTopFeatures) <- x
todaysTopFeatures[as.character(1:n)]
}), row.names = 1:n)
## 2013-10-02 2013-10-04 2013-10-05 2013-10-06 2013-10-07 2013-10-09 2013-10-13 2013-10-16 2013-10-18 2013-10-19 2013-10-22 2013-10-23
## 1 â great go triple great play â go great nice year year
## 2 win nice will niple nice go created â nice play â give
## 3 year â â backflip play will wasnt great play â give good
## 4 give play can great go can money will go go good hard
## 5 good go get scope â get prizes can will will hard time
## 6 hard will ever ball will ever nice get can can time triple
## 7 time can first â can first piece ever get get triple niple
## 8 triple get fun nice get fun dead first ever ever niple backflip
## 9 niple ever great testical ever win play fun first first backflip scope
## 10 backflip first win play first year go win fun fun scope ball
## 2013-10-25 2013-10-27 2013-10-30 2014-10-01 2014-10-02 2014-10-03 2014-10-05 2014-10-06
## 1 scope scope great play great play will play
## 3 testical testical play will play will get will
## 2 ball ball nice go nice go can go
## 4 â great go can go can ever can
## 5 nice shot will get will get first get
## 6 great nice can ever can ever fun ever
## 7 shot head get â get first win first
## 8 head â ever first ever fun year fun
## 9 dancing dancing first fun first win give win
## 10 cow cow fun win fun year good year
順便說一句有趣的單詞和證詞拼寫。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.