[英]Problems using large custom stopword lists in tm package (R)
我相信你們很多人以前都見過這個:
Warnmeldung:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code
這一次,當我嘗試從語料庫中刪除自定義停用詞列表時,出現錯誤。
asdf <- tm_map(asdf, removeWords ,mystops)
它適用於小的停用詞列表(我一直嘗試到 100 左右),但我目前的停用詞列表大約有 42000 個單詞。
我試過這個:
asdf <- tm_map(asdf, removeWords ,mystops, lazy=T)
這不會給我一個錯誤,但是在此之后的每個 tm_map 命令都會給我同樣的錯誤,當我嘗試從語料庫計算 DTM 時:
Fehler in UseMethod("meta", x) :
nicht anwendbare Methode für 'meta' auf Objekt der Klasse "try-error" angewendet
Zusätzlich: Warnmeldung:
In mclapply(unname(content(x)), termFreq, control) :
all scheduled cores encountered errors in user code
我正在考慮一個函數,為我的列表的一小部分循環 removeWords 命令,但我也想了解,為什么列表的長度是一個問題..
這是我的 sessionInfo():
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6
locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] SnowballC_0.5.1 wordcloud_2.5 RColorBrewer_1.1-2 RTextTools_1.4.2 SparseM_1.74 topicmodels_0.2-4 tm_0.6-2
[8] NLP_0.1-9
loaded via a namespace (and not attached):
[1] Rcpp_0.12.7 splines_3.3.2 MASS_7.3-45 tau_0.0-18 prodlim_1.5.7 lattice_0.20-34 foreach_1.4.3
[8] tools_3.3.2 caTools_1.17.1 nnet_7.3-12 parallel_3.3.2 grid_3.3.2 ipred_0.9-5 glmnet_2.0-5
[15] e1071_1.6-7 iterators_1.0.8 modeltools_0.2-21 class_7.3-14 survival_2.39-5 randomForest_4.6-12 Matrix_1.2-7.1
[22] lava_1.4.5 bitops_1.0-6 codetools_0.2-15 maxent_1.3.3.1 rpart_4.1-10 slam_0.1-38 stats4_3.3.2
[29] tree_1.0-37
編輯:
我使用 20news-bydate.tar.gz 並且只使用 train 文件夾。
我不會分享我正在做的所有預處理,因為它包括對整個事情的形態分析(不是用 R)。
這是我的 R 代碼:
library(tm)
library(topicmodels)
library(SnowballC)
asdf <- Corpus(DirSource("/path/to/20news-bydate/train",encoding="UTF-8"),readerControl=list(language="en"))
asdf <- tm_map(asdf, content_transformer(tolower))
asdf <- tm_map(asdf, removeWords, stopwords(kind="english"))
asdf <- tm_map(asdf, removePunctuation)
asdf <- tm_map(asdf, removeNumbers)
asdf <- tm_map(asdf, stripWhitespace)
# until here: preprocessing
# building DocumentTermMatrix with term frequency
dtm <- DocumentTermMatrix(asdf, control=list(weighting=weightTf))
# building a matrix from the DTM and wordvector (all words as titles,
# sorted by frequency in corpus) and wordlengths (length of actual
# wordstrings in the wordvector)
m <- as.matrix(dtm)
wordvector <- sort(colSums(m),decreasing=T)
wordlengths <- nchar(names(wordvector))
names(wordvector[wordlengths>22]) -> mystops1 # all words longer than 22 characters
names(wordvector)[wordvector<3] -> mystops2 # all words with occurence <3
mystops <- c(mystops1,mystops2) # the stopwordlist
# going back to the corpus to remove the chosen words
asdf <- tm_map(asdf, removeWords ,mystops)
這是我得到錯誤的地方。
正如我在評論中懷疑的那樣: tm
包中的removeWords
使用 perl 正則表達式。 所有單詞都使用或|
連接起來|
管。 在您的情況下,結果字符串有太多字符:
gsub(regex, "", txt, perl = TRUE) 中的錯誤:無效的正則表達式 '(*UCP)\\b(zxmkrstudservzdvunituebingende|zxmkrstudservzdvunituebingende|...|unwantingly| 另外:警告消息:在 gsub(regex, "" , txt, perl = TRUE) : PCRE 模式編譯錯誤 'regular expression is too large' at ''
一種解決方案:定義您自己的removeWords
函數,該函數將在字符限制處過大的正則表達式拆分,然后單獨應用每個拆分的正則表達式,使其不再達到限制:
f <- content_transformer({function(txt, words, n = 30000L) {
l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1)))
groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n))
regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")))
for (regex in regexes) txt <- gsub(regex, "", txt, perl = TRUE)
return(txt)
}})
asdf <- tm_map(asdf, f, mystops)
您的自定義停用詞太大,因此您必須將其分解:
group <- 100
n <- length(myStopwords)
r <- rep(1:ceiling(n/group),each=group)[1:n]
d <- split(myStopwords,r)
for (i in 1:length(d)) {
asdf <- removeWords(asdf, c(paste(d[[i]])))
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.