是否可以将 spacy 令牌列表合并到文档中

Question

I have a document which I've tokenized using Spacy tokenizer.我有一个使用 Spacy 标记器标记的文档。 I want to apply ner on a sequence of tokens(a section of this document).我想将ner应用于一系列令牌（本文档的一部分）。

Currently I'm creating a doc first and then applying ner目前我正在创建一个文档，然后应用ner

nlp = spacy.load("en_core_web_sm")
# tokens_list is a list of Spacy tokens
words = [tok.text for tok in tokens_list]

spaces = [True if tok.whitespace_ else False for tok in tokens_list]

doc = spacy.tokens.doc.Doc(blackstone_nlp.vocab,
      words=words, spaces=spaces)
doc = nlp.get_pipe("ner")(doc)

But this is not ideal because I loose their original ids within the document, which is important.但这并不理想，因为我在文档中丢失了它们的原始 ID，这很重要。

Is there a way to merge tokens into a doc and still maintain their ids(including other future extensions )?有没有办法将令牌合并到文档中并仍然保持它们的 ID（包括其他未来的扩展）？

Answer 1

To merge a list of tokens back into Doc you may wish to try:要将令牌列表合并回Doc您可能希望尝试：

import spacy
nlp = spacy.load("en_core_web_sm")
txt = "This is some text"
doc = nlp(txt)
words = [tok.text for tok in doc]

spaces = [True if tok.whitespace_ else False for tok in doc]
doc2 = spacy.tokens.doc.Doc(nlp.vocab, words=words, spaces=spaces)

是否可以将 spacy 令牌列表合并到文档中

问题描述

1 个解决方案

解决方案1
0 2020-10-06 15:31:36

是否可以将 spacy 令牌列表合并到文档中

问题描述

1 个解决方案

解决方案1 0 2020-10-06 15:31:36

解决方案1
0 2020-10-06 15:31:36