[英](Fast) word frequency matrix in R
I am writing an R program that involves analyzing a large amount of unstructured text data and creating a word-frequency matrix. 我正在编写一个R程序,其中涉及分析大量非结构化文本数据并创建一个词频矩阵。 I've been using the wfm
and wfdf
functions from the qdap
package, but have noticed that this is a bit slow for my needs. 我一直在使用qdap
软件包中的wfm
和wfdf
函数,但是注意到这对于我的需求来说有点慢。 It appears that the production of the word-frequency matrix is the bottleneck. 看来字频矩阵的产生是瓶颈。
The code for my function is as follows. 我的函数的代码如下。
library(qdap)
liwcr <- function(inputText, dict) {
if(!file.exists(dict))
stop("Dictionary file does not exist.")
# Read in dictionary categories
# Start by figuring out where the category list begins and ends
dictionaryText <- readLines(dict)
if(!length(grep("%", dictionaryText))==2)
stop("Dictionary is not properly formatted. Make sure category list is correctly partitioned (using '%').")
catStart <- grep("%", dictionaryText)[1]
catStop <- grep("%", dictionaryText)[2]
dictLength <- length(dictionaryText)
dictionaryCategories <- read.table(dict, header=F, sep="\t", skip=catStart, nrows=(catStop-2))
wordCount <- word_count(inputText)
outputFrame <- dictionaryCategories
outputFrame["count"] <- 0
# Now read in dictionary words
no_col <- max(count.fields(dict, sep = "\t"), na.rm=T)
dictionaryWords <- read.table(dict, header=F, sep="\t", skip=catStop, nrows=(dictLength-catStop), fill=TRUE, quote="\"", col.names=1:no_col)
workingMatrix <- wfdf(inputText)
for (i in workingMatrix[,1]) {
if (i %in% dictionaryWords[, 1]) {
occurrences <- 0
foundWord <- dictionaryWords[dictionaryWords$X1 == i,]
foundCategories <- foundWord[1,2:no_col]
for (w in foundCategories) {
if (!is.na(w) & (!w=="")) {
existingCount <- outputFrame[outputFrame$V1 == w,]$count
outputFrame[outputFrame$V1 == w,]$count <- existingCount + workingMatrix[workingMatrix$Words == i,]$all
}
}
}
}
return(outputFrame)
}
I realize the for loop is inefficient, so in an effort to locate the bottleneck, I tested it without this portion of the code (simply reading in each text file and producing the word-frequency matrix), and seen very little in the way of speed improvements. 我意识到for循环效率低下,因此为了找到瓶颈,我在没有代码的这一部分的情况下对其进行了测试(只需读取每个文本文件并生成字频矩阵),并且在处理过程中很少看到速度提高。 Example: 例:
library(qdap)
fn <- reports::folder(delete_me)
n <- 10000
lapply(1:n, function(i) {
out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))
})
filename <- sprintf("tweet%s.txt", 1:n)
for(i in 1:length(filename)){
print(filename[i])
text <- readLines(paste0("/toshi/twitter_en/", filename[i]))
freq <- wfm(text)
}
The input files are Twitter and Facebook status postings. 输入文件是Twitter和Facebook状态发布。
Is there any way to improve the speed for this code? 有什么方法可以提高此代码的速度吗?
EDIT2: Due to institutional restrictions, I can't post any of the raw data. EDIT2:由于机构限制,我无法发布任何原始数据。 However, just to give an idea of what I'm dealing with: 25k text files, each with all the available tweets from an individual Twitter user. 但是,仅提供有关我要处理的内容的想法:25k文本文件,每个文件都包含来自单个Twitter用户的所有可用推文。 There are also an additional 100k files with Facebook status updates, structured in the same way. 还有另外10万个具有Facebook状态更新的文件,其结构相同。
Here is a qdap
approach and a mixed qdap/tm
approach that is faster. 这是qdap
方法和更快的qdap/tm
混合方法。 I provide the code and then the timings on each. 我提供了代码,然后提供了每个代码的时间安排。 Basically I read everything in at once and operator on the entire data set. 基本上,我一次读取所有内容,然后对整个数据集进行运算符读取。 You could then split it back apart if you wanted with split
. 然后,你可以回去开,如果你想用它拆分split
。
A MWE that you should provide with questions 您应该提出问题的MWE
library(qdap)
fn <- reports::folder(delete_me)
n <- 10000
lapply(1:n, function(i) {
out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))
})
filename <- sprintf("tweet%s.txt", 1:n)
The qdap approach qdap方法
tic <- Sys.time() ## time it
dat <- list2df(setNames(lapply(filename, function(x){
readLines(file.path(fn, x))
}), tools::file_path_sans_ext(filename)), "text", "tweet")
difftime(Sys.time(), tic) ## time to read in
the_wfm <- with(dat, wfm(text, tweet))
difftime(Sys.time(), tic) ## time to make wfm
Timing qdap approach 定时QDAP方法
> tic <- Sys.time() ## time it
>
> dat <- list2df(setNames(lapply(filename, function(x){
+ readLines(file.path(fn, x))
+ }), tools::file_path_sans_ext(filename)), "text", "tweet")
There were 50 or more warnings (use warnings() to see the first 50)
>
> difftime(Sys.time(), tic) ## time to read in
Time difference of 2.97617 secs
>
> the_wfm <- with(dat, wfm(text, tweet))
>
> difftime(Sys.time(), tic) ## time to make wfm
Time difference of 48.9238 secs
The qdap-tm combined approach qdap-tm组合方法
tic <- Sys.time() ## time it
dat <- list2df(setNames(lapply(filename, function(x){
readLines(file.path(fn, x))
}), tools::file_path_sans_ext(filename)), "text", "tweet")
difftime(Sys.time(), tic) ## time to read in
tweet_corpus <- with(dat, as.Corpus(text, tweet))
tdm <- tm::TermDocumentMatrix(tweet_corpus,
control = list(removePunctuation = TRUE,
stopwords = FALSE))
difftime(Sys.time(), tic) ## time to make TermDocumentMatrix
Timing qdap-tm combined approach 定时QDAP-TM组合方法
> tic <- Sys.time() ## time it
>
> dat <- list2df(setNames(lapply(filename, function(x){
+ readLines(file.path(fn, x))
+ }), tools::file_path_sans_ext(filename)), "text", "tweet")
There were 50 or more warnings (use warnings() to see the first 50)
>
> difftime(Sys.time(), tic) ## time to read in
Time difference of 3.108177 secs
>
>
> tweet_corpus <- with(dat, as.Corpus(text, tweet))
>
> tdm <- tm::TermDocumentMatrix(tweet_corpus,
+ control = list(removePunctuation = TRUE,
+ stopwords = FALSE))
>
> difftime(Sys.time(), tic) ## time to make TermDocumentMatrix
Time difference of 13.52377 secs
There is a qdap-tm Package Compatibility (-CLICK HERE-) to help users move between qdap and tm. 有一个qdap-tm软件包兼容性(-CLICK HERE-)可帮助用户在qdap和tm之间切换。 As you can see on 10000 tweets the combined approach is ~3.5 x faster. 如您在10000条推文上看到的,组合方法快了约3.5倍。 A purely tm
approach may be faster still. 单纯的tm
方法可能仍然更快。 Also if you want the wfm
use as.wfm(tdm)
to coerce the TermDocumentMatrix
. 此外,如果你想要的wfm
使用as.wfm(tdm)
要挟TermDocumentMatrix
。
Your code though is slower either way because it's not the R way to do things. 但是您的代码无论哪种方式都较慢,因为它不是R的处理方式。 I'd recommend reading some additional info on R to get better at writing faster code. 我建议阅读一些有关R的其他信息,以更好地编写更快的代码。 I'm currently working through Hadley Wickham's Advanced R that I'd recommend. 我目前正在研究建议的Hadley Wickham的AdvancedR 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.