简体   繁体   English

RQuanteda库,语料库创建错误

[英]R quanteda library, error in corpus creation

I have a curious error which only happens in my colleagues RStudio when they run the code. 我有一个奇怪的错误,只有在我的同事RStudio运行代码时才会发生。 The code is dealing with text corpus, and this is what I do: 代码正在处理文本语料库,这就是我要做的:

ap.corpus <- corpus(raw.data$text) 
 ap.corpus
#Corpus consisting of 214,226 documents and 0 docvars.
ap.corpus <- Corpus(VectorSource(ap.corpus))
    ap.corpus <- tm_map(ap.corpus,tolower)
ap.corpus<-corpus(ap.corpus)

The last step is just reformatting before I get to the model. 最后一步是重新格式化,然后再进入模型。 I run this code smoothly with no issues. 我顺利运行此代码,没有任何问题。 My to colleagues, on the other hand, try to run exactly same code on exactly the same data and get the following error after ap.corpus<-corpus(ap.corpus: nrow(docvars)==length(x) is not TRUE 另一方面,我的同事尝试在完全相同的数据上运行完全相同的代码,并在ap.corpus <-corpus(ap.corpus:nrow(docvars)== length(x)不正确)后得到以下错误

We tried to reboot R studio, tried to run on a smaller corpus (only 500 doc), still same error. 我们尝试重新启动R Studio,尝试在较小的语料库(仅500 doc)上运行,仍然存在相同的错误。 Hoping anyone else experienced similar error. 希望其他任何人也遇到类似的错误。 This one doesn't appear to be the code issue, as I never experienced such error running this or similar codes in my RStudio. 这似乎不是代码问题,因为我从未在RStudio中运行此代码或类似代码遇到过此类错误。 Note: my colleague also ran the code in R, avoiding RStudio. 注意:我的同事也在R中运行了代码,避免使用RStudio。 Same issue. 同样的问题。

This is impossible to verify without a reproducible example, but I have created one here since this might have been a bug. 如果没有可复制的示例,这是无法验证的,但是我在这里创建了一个示例,因为这可能是一个错误。 Based on my attempt to reproduce the reported error, however, I don't think that it is. 但是,基于我尝试重现所报告错误的尝试,我认为并非如此。

This sort of question would be better filed as an issue at the quanteda GitHub issues site rather than a SO question. 此类问题最好在Quanteda GitHub问题站点上作为问题而不是SO问题来提出。 But good to address here since I will also show you a way to avoid the use of tm (even though your example does not specify that, it is clear you are using some of its functions). 但在这里要解决,因为我还将向您展示一种避免使用tm的方法 (即使您的示例未指定该方法,但显然您正在使用其某些功能)。

library("quanteda")
## quanteda version 0.99.22
## Using 7 of 8 threads for parallel computing

ap.corpus <- corpus(LETTERS[1:10])
ap.corpus
## Corpus consisting of 10 documents and 0 docvars.
texts(ap.corpus)
## text1  text2  text3  text4  text5  text6  text7  text8  text9 text10 
##   "A"    "B"    "C"    "D"    "E"    "F"    "G"    "H"    "I"    "J" 

ap.corpus <- tm::Corpus(tm::VectorSource(ap.corpus))
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 10
ap.corpus <- tm::tm_map(ap.corpus, tolower)

corpus(ap.corpus)
## Corpus consisting of 10 documents and 0 docvars.
corpus(ap.corpus) %>% texts()
## text1  text2  text3  text4  text5  text6  text7  text8  text9 text10 
##   "a"    "b"    "c"    "d"    "e"    "f"    "g"    "h"    "i"    "j" 

So that all appears to work just fine. 这样看来一切正常。

However, there is no need to use tm for this. 但是,不需要为此使用tm You could have done the following in quanteda : 您可以在Quanteda中完成以下操作

ap.corpus2 <- corpus(LETTERS[1:10])
texts(ap.corpus2) <- char_tolower(texts(ap.corpus2))
texts(ap.corpus2)
## text1  text2  text3  text4  text5  text6  text7  text8  text9 text10 
##   "a"    "b"    "c"    "d"    "e"    "f"    "g"    "h"    "i"    "j" 

However we discourage you from modifying your corpus directly, since the is a destructive change that will mean that you cannot recover the cased version of your texts, should you wish to use these for other purposes. 但是,我们不建议您直接修改语料库,因为这是一种破坏性的更改,如果您希望将这些文本用于其他目的,则您将无法恢复其带大小写的文本。

Much better to use a workflow such as: 使用工作流更好,例如:

corpus(c("A B C", "C D E")) %>%
    tokens() %>%
    tokens_tolower()

## tokens from 2 documents.
## text1 :
## [1] "a" "b" "c"
## 
## text2 :
## [1] "c" "d" "e"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM