简体   繁体   English

如何使用 quanteda 引导文本可读性统计?

[英]How can I bootstrap text readability statistics using quanteda?

I'm new to both bootstrapping and the quanteda package for text analysis.我是引导程序和用于文本分析的 quanteda 包的新手。 I have a large corpus of texts organized by document group type that I'd like to obtain readability scores for.我有大量按文档组类型组织的文本语料库,我想获取其可读性分数。 I can easily obtain readability scores for each group with the following function:我可以使用以下函数轻松获得每个组的可读性分数:

textstat_readability(texts(mwe, groups = "document"), "Flesch")

I then want to bootstrap the results to obtain a 95% confidence interval by wrapping a function:然后我想通过包装一个函数来引导结果以获得 95% 的置信区间:

b_readability <- function(x, i, groups = NULL, measure = "Flesch")
textstat_readability(texts(x[i], groups = groups), measure) 
n <- 10

groups <- factor(mwe[["document"]]$document)  
b <- boot(texts(mwe), b_readability, strata = groups, R = n, groups = groups) 
colnames(b$t) <- names(b$t0)
apply(b$t, 2, quantile, c(.025, .5, .975)) 

But "b <-" fails with the error: "Error in t.star[r, ] <- res[[r]] : incorrect number of subscripts on matrix"但是“b <-”失败并显示错误:“t.star[r, ] <- res[[r]] 中的错误:矩阵上的下标数不正确”

I've wasted two days trying to debug with no luck.我浪费了两天时间尝试调试但没有运气。 What am I doing wrong?我究竟做错了什么? Much appreciated for any advice...非常感谢您的任何建议...

MWE: MWE:

mwe<-structure(list(document = structure(c(1L, 1L), 
.Label = c("a", "b", "c", "d", "e"), class = "factor"),  text = c("Text 1. Text 1.1", "Text 2."), section = structure(2:1, .Label = c("aa", "bb", "cc", "dd", "ee", "ff", "hh", "ii", "jj", "kk"), class = "factor"), year = c(1919L, 1944L), preamble = structure(8:9, .Label = c("1", "2","3", " 4 ", "5", "6  ",  "7  ",  "8  ", "9  ",  "10 "), class = "factor"), articles = c(43L, 47L), pages = c(5.218, 7.666), wordcount = c(3503L, 4929L), mean_articles = c(45, 45)), row.names = 1:2, class = "data.frame")

mwe <- corpus(mwe)

b_readability <- function(x, i, groups = NULL, measure = "Flesch")
textstat_readability(texts(x[i], groups = groups), measure) 
n <- 10

groups <- factor(mwe[["document"]]$document)  
b <- boot(texts(mwe), b_readability, strata = groups, R = n, groups = groups) 
colnames(b$t) <- names(b$t0)
apply(b$t, 2, quantile, c(.025, .5, .975)) 

A good question that involves knowing a lot about the boot package as well as how to index and group corpus texts in quanteda .一个很好的问题,涉及对引导包的了解以及如何在quanteda 中对语料库文本进行索引和分组。 Here's the best (currently) and safest way to do it.这是最好的(目前)和最安全的方法。 "Safest" here means future-proof, since there are some things that currently work in the internal addressing of a quanteda corpus that will not work in upcoming v2.此处的“最安全”意味着面向未来,因为目前在quanteda语料库的内部寻址中可以使用的某些内容在即将发布的 v2 中将无法使用。 (We warn about this very clearly in ?corpus but no one seems to heed that warning...) Note also that while this should always work, we are also planning, in future versions, more direct methods for bootstrapping textual statistics that would not require the user to do this sort of deep dive into the boot package. (我们在?corpus非常清楚地警告了这一点,但似乎没有人注意到这个警告......)还要注意,虽然这应该总是有效的,但我们也在计划,在未来的版本中,更直接的方法来引导文本统计,不会要求用户对引导包进行这种深入研究。

Let's try a reproducible example from built-in objects first.让我们先从内置对象中尝试一个可重现的示例。 To "bootstrap" a text, we will construct a new, hypothetical text using sentence-level resampling (with replacement) from the original, and use texts(x, groups = "<groupvar>") to piece this together into a hypothetical kind of text.为了“引导”文本,我们将使用句子级重采样(替换)从原始文本构建一个新的假设文本,并使用texts(x, groups = "<groupvar>")将其拼凑成一个假设类型的文本。 (This is how I have done in in the two references at the end of this post.) To make this happen, we can exploit the property of texts() that it works to get texts from a corpus object but also works on character objects (but with fast grouping). (这就是我在本文末尾的两个参考文献中所做的。)为了实现这一点,我们可以利用texts()的属性,它可以从语料库对象中获取文本,但也可以处理字符对象(但快速分组)。

To get the sentences, after subsetting the corpus to simplify our example here, we reshape it into sentences.为了得到句子,在对语料库进行子集化以简化我们这里的示例后,我们将其重塑为句子。

First, however, I recorded the original document's name in a new document variable, so that we can use this for grouping later.但是,首先,我将原始文档的名称记录在一个新的文档变量中,以便我们以后可以使用它进行分组。 In this example, we could also have used Year, but doing it this way will work for any example.在这个例子中,我们也可以使用 Year,但这样做适用于任何例子。 (There are some internal records about the original docname that we might have used, but doing it this way will be future-proof.) (有一些关于我们可能使用过的原始文档名的内部记录,但这样做将是面向未来的。)

library("quanteda")
## Package version: 1.4.1
library("boot")

docvars(data_corpus_inaugural, "docnameorig") <- docnames(data_corpus_inaugural)
sent_corpus <- data_corpus_inaugural %>%
  corpus_subset(Year > 2000) %>%
  corpus_reshape(to = "sentences")

Then we have to define the function to be bootstrapped.然后我们必须定义要引导的函数。 We will use the "index" method and call the index i (as you did above).我们将使用“index”方法并调用索引i (如您在上面所做的那样)。 Here, x will be a character and not a corpus, even though we will call texts() on it, again, using the grouping variable to reassemble it.在这里, x将是一个character不是一个语料库,即使我们将在其上调用texts() ,再次使用分组变量重新组装它。 This will also need to return a vector and not a data.frame, which is normal form of a textstat_*() return.这还需要返回一个向量而不是 data.frame,这是textstat_*()返回的正常形式。 So we will extract just the measure column and return it as a vector.因此,我们将仅提取measure列并将其作为向量返回。

b_readability <- function(x, i, groups = NULL, measure = "Flesch") {
  textstat_readability(texts(x[i], groups = groups[i]), measure)[[measure]]
}

We will call our grouping variable simgroups just to distinguish the value from the argument name, and use this for both the groups argument and for strata in the call to boot() .我们将调用分组变量simgroups只是为了将值与参数名称区分开来,并在调用boot()将其用于groups参数和strata The strata is an argument to boot() , while groups is passed through to our function b_readability() . strataboot()的参数,而groups则传递给我们的函数b_readability() We need to factorize this grouping variable since the function seems to want that.我们需要分解这个分组变量,因为函数似乎想要那个。 Then we call boot() and get our answer.然后我们调用boot()并得到我们的答案。

simgroups <- factor(docvars(sent_corpus, "docnameorig"))

boot(texts(sent_corpus), b_readability, R = 10, 
     strata = simgroups, groups = simgroups)
## 
## STRATIFIED BOOTSTRAP
## 
## 
## Call:
## boot(data = texts(sent_corpus), statistic = b_readability, R = 10, 
##     strata = simgroups, groups = simgroups)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 60.22723 -0.01454477    2.457416
## t2* 53.23332  1.24942328    2.564719
## t3* 60.56705  1.07426297    1.996705
## t4* 53.55532 -0.28971190    1.943986
## t5* 58.63471  0.52289051    2.502101

These correspond to the five (original) documents, here distinguished by year, although unfortunately those names have been replaced by t1 , t2 , ... in the return object from boot() .这些对应于五个(原始)文档,这里按年份区分,但不幸的是,这些名称已被t1t2 、 ... 在boot()的返回对象中替换。

To return to your original example , let's say these form two documents from one strata (since these are too short two subdivide further).回到你原来的例子,假设这些形成了来自一个层的两个文件(因为它们太短了,两个进一步细分)。 Then:然后:

simgroups <- rep(1, ndoc(mwe))
boot(texts(mwe), b_readability, R = 10, strata = simgroups, groups = simgroups)
## 
## STRATIFIED BOOTSTRAP
## 
## 
## Call:
## boot(data = texts(mwe), statistic = b_readability, R = 10, strata = simgroups, 
##     groups = simgroups)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1*   119.19 0.6428333   0.4902916

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM