简体   繁体   English

使用 Quanteda 清理语料库

[英]Clean corpus using Quanteda

What's the Quanteda way of cleaning a corpus like shown in the example below using tm (lowercase, remove punct., remove numbers, stem words)? Quanteda清理语料库的方法是什么,如下例所示,使用tm (小写,删除 punct.,删除数字,词干)? To be clear, I don't want to create a document-feature matrix with dfm() , I just want a clean corpus that I can use for a specific downstream task.需要明确的是,我不想使用dfm()创建文档特征矩阵,我只想要一个可以用于特定下游任务的干净语料库。

# This is what I want to do in quanteda
library("tm")
data("crude")
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)

PS I am aware that I could just do quanteda_corpus <- quanteda::corpus(crude) to get what I want, but I would much prefer being able to do everything in Quanteda. PS我知道我可以做quanteda_corpus <- quanteda::corpus(crude)来得到我想要的,但我更希望能够在 Quanteda 中做所有事情。

I think what you want to do is deliberately impossible in quanteda .我认为您想要做的事情在quanteda中是故意不可能的。

You can, of course, do the cleaning quite easily without losing the order of words using the tokens* set of functions:当然,您可以使用tokens*函数集很容易地进行清理,而不会丢失单词的顺序:

library("tm")
data("crude")
library("quanteda")
toks <- corpus(crude) %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE) %>% 
  tokens_wordstem()

print(toks, max_ndoc = 3)
#> Tokens consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#>  [1] "Diamond"  "Shamrock" "Corp"     "said"     "that"     "effect"  
#>  [7] "today"    "it"       "had"      "cut"      "it"       "contract"
#> [ ... and 78 more ]
#> 
#> reut-00002.xml :
#>  [1] "OPEC"    "may"     "be"      "forc"    "to"      "meet"    "befor"  
#>  [8] "a"       "schedul" "June"    "session" "to"     
#> [ ... and 427 more ]
#> 
#> reut-00004.xml :
#>  [1] "Texaco"   "Canada"   "said"     "it"       "lower"    "the"     
#>  [7] "contract" "price"    "it"       "will"     "pay"      "for"     
#> [ ... and 40 more ]
#> 
#> [ reached max_ndoc ... 17 more documents ]

But it is not possible to return this tokens object into a corpus.但是不可能将此tokens object 返回到语料库中。 Now it would be possible to write a new function to do this:现在可以编写一个新的 function 来执行此操作:

corpus.tokens <- function(x, ...) {
  quanteda:::build_corpus(
    unlist(lapply(x, paste, collapse = " ")),
    docvars = cbind(quanteda:::make_docvars(length(x), docnames(x)), docvars(x))
  )
}

corp <- corpus(toks)
print(corp, max_ndoc = 3)
#> Corpus consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#> "Diamond Shamrock Corp said that effect today it had cut it c..."
#> 
#> reut-00002.xml :
#> "OPEC may be forc to meet befor a schedul June session to rea..."
#> 
#> reut-00004.xml :
#> "Texaco Canada said it lower the contract price it will pay f..."
#> 
#> [ reached max_ndoc ... 17 more documents ]

But this object, while technically being a corpus class object, is not what a corpus is supposed to be.但是这个 object,虽然在技术上是一个corpus class object,但并不是语料库应该是的。 From ?corpus [emphasis added]:来自?corpus [强调添加]:

Value价值

A corpus class object containing the original texts , document-level variables, document-level metadata, corpus-level metadata, and default settings for subsequent processing of the corpus.一个语料库 class object 包含原始文本、文档级变量、文档级元数据、语料库级元数据以及后续处理语料库的默认设置。

The object above does not meet this description as the original texts have been processed already.上面的object不符合这个描述,因为原文已经处理过了。 Yet the class of the object communicates otherwise.然而 object 的 class 以其他方式进行通信。 I don't see a reason to break this logic as all subsequent analyses steps should be possible using either tokens* or dfm* functions.我认为没有理由打破这种逻辑,因为所有后续分析步骤都应该可以使用tokens*dfm*函数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM