简体   繁体   English

在 textacy 中创建空语料库

[英]Create empty Corpus in textacy

I want to create an empty corpus in textacy and later on fill it up with data via我想在文本中创建一个空的语料库,然后通过以下方式用数据填充它

corpus.add(doc)

But everytime I try to create an empty corpus I am not able to save it and instead I get this error:但是每次我尝试创建一个空的语料库时,我都无法保存它,而是收到此错误:

IndexError: list index out of range

I tried both not giving any data when creating the corpus or giving None as data:我尝试在创建语料库时不提供任何数据或提供 None 作为数据:

corpus = textacy.Corpus(lang=locale)
corpus = textacy.Corpus(lang=locale, data=None)
corpus.save(path) # this line results in the index error

It would be nice if anybody could help me :)如果有人可以帮助我,那就太好了:)

I have just tried this out myself.我自己刚刚试过这个。 What is locale exactly?什么是locale I have performed the following:我执行了以下操作:

  1. created spacy language object for german language with为德语创建了 spacy 语言对象

nlp = spacy.load("de_core_news_lg")

  1. and then passed it to然后将其传递给

corpus = textacy.Corpus(nlp)

After that I was able to iterate through my documents and add them item per item.之后,我能够遍历我的文档并为每个项目添加项目。

However, I would not recommend doing this.但是,我不建议这样做。 I have performed two scenarios to process 15k short comments:我已经执行了两个场景来处理 15k 条简短评论:

  • I first preprocessed my documents as a list and put it directly into textacy.Corpus(nlp, data=preprocessed_list) .我首先将我的文档作为列表进行预处理,并将其直接放入textacy.Corpus(nlp, data=preprocessed_list) That took me around 22 s .我花了大约22 s
  • Performing the same logic, but by creating an empty corpus and adding each one item to it lasted 1 min 26 s .执行相同的逻辑,但通过创建一个空的语料库并将每个项目添加到其中持续了1 min 26 s

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM