简体   繁体   English

在版本 >= 2 中无法访问 quanteda 语料库数量

[英]Trouble accessing quanteda corpus quantities in version >= 2

I am having a problem when running the same script I have written before.运行我之前编写的相同脚本时遇到问题。 Back then, when I applied quanteda::corpus on a readtext object, it returned a "corpus" and "list" class object.那时,当我在阅读文本 object 上应用 quanteda::corpus 时,它返回了一个“语料库”和“列表”class object。 But when I run the same script it returns "corpus" and "character" class objects now.但是当我运行相同的脚本时,它现在返回“语料库”和“字符”class 对象。 And this affects the subsequent codes.这会影响后续代码。 What could be the reason for this and how can I solve this issue?这可能是什么原因,我该如何解决这个问题?

Here is the script:这是脚本:

txt <- readtext("C:/Users/aerol/Desktop/txt_sample")
corpus_txt <- corpus(txt) %>%
  corpus_reshape(to = "sentences")

docvars(corpus_txt, "Treaty") <- corpus_txt$documents$`_document`
docvars(corpus_txt, "Year") <- as.integer(stri_sub(corpus_txt$documents$`_document`, -9, -6))

The files are international treaties.这些文件是国际条约。 All the filenames are in the same format, they contain the name of the treaty and the year it was signed.所有文件名都采用相同的格式,它们包含条约名称和签署年份。 And I was extracting these.我正在提取这些。

Back then the the class of corpus txt was "corpus" "list":那时语料库txt的class是“语料库”“列表”:

> class(corpus_txt)
[1] "corpus" "list"  

But now:但现在:

> class(corpus_txt)
[1] "corpus"    "character"
> packageVersion("quanteda")
[1] ‘2.1.2’

And I cannot extract information from the corpus the way I did before.而且我无法像以前那样从语料库中提取信息。 Since I was working on this since the last October I should be using the same version all along.由于我自去年 10 月以来一直在研究此问题,因此我应该一直使用相同的版本。

Many thanks in advance.提前谢谢了。

We changed the corpus internal structure in v2, after two years of warning in the documentation that users should not access the corpus internals directly, or their code would not likely work under future major versions.我们在 v2 中更改了语料库的内部结构,此前两年的文档中警告说用户不应直接访问语料库内部结构,否则他们的代码可能无法在未来的主要版本中运行。

From https://github.com/quanteda/quanteda/blob/master/NEWS.md#quanteda-20 :https://github.com/quanteda/quanteda/blob/master/NEWS.md#quanteda-20

quanteda 2.0 introduces some major changes, detailed here. quanteda 2.0 引入了一些重大变化,详见此处。

  1. New corpus object structure.新的语料库 object 结构。

    The internals of the corpus object have been redesigned, and now are based around a character vector with meta- and system-data in attributes.语料库 object 的内部结构已经过重新设计,现在基于属性中包含元数据和系统数据的字符向量。 These are all updated to work with the existing extractor and replacement functions.这些都已更新以与现有的提取器和替换功能一起使用。 If you were using these before, then you should not even notice the change.如果您以前使用过这些,那么您甚至不应该注意到变化。 Docvars are now handled separately from the texts, in the same way that docvars are handled for tokens objects. Docvars 现在与文本分开处理,就像处理令牌对象的 docvars 一样。

From ?corpus :来自?corpus

For quanteda >= 2.0, this is a specially classed character vector.对于quanteda >= 2.0,这是一个特殊分类的字符向量。 It has many additional attributes but you should not access these attributes directly , especially if you are another package author.它有许多附加属性,但您不应该直接访问这些属性,特别是如果您是 package 的另一位作者。 Use the extractor and replacement functions instead, or else your code is not only going to be uglier, but also likely to break should the internal structure of a corpus object change.请改用提取器和替换函数,否则您的代码不仅会更难看,而且如果语料库 object 的内部结构发生变化,还可能会中断。 Using the accessor and replacement functions ensures that future code to manipulate corpus objects will continue to work.使用访问器和替换函数可确保将来操作语料库对象的代码将继续工作。

Solution?解决方案? Use docnames(corpus_txt) .使用docnames(corpus_txt)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM