简体   繁体   English

是否可以将 spacy 令牌列表合并到文档中

[英]Is it possible to merge a list of spacy tokens into a doc

I have a document which I've tokenized using Spacy tokenizer.我有一个使用 Spacy 标记器标记的文档。 I want to apply ner on a sequence of tokens(a section of this document).我想将ner应用于一系列令牌(本文档的一部分)。

Currently I'm creating a doc first and then applying ner目前我正在创建一个文档,然后应用ner

nlp = spacy.load("en_core_web_sm")
# tokens_list is a list of Spacy tokens
words = [tok.text for tok in tokens_list]

spaces = [True if tok.whitespace_ else False for tok in tokens_list]

doc = spacy.tokens.doc.Doc(blackstone_nlp.vocab,
      words=words, spaces=spaces)
doc = nlp.get_pipe("ner")(doc)

But this is not ideal because I loose their original ids within the document, which is important.但这并不理想,因为我在文档中丢失了它们的原始 ID,这很重要。

Is there a way to merge tokens into a doc and still maintain their ids(including other future extensions )?有没有办法将令牌合并到文档中并仍然保持它们的 ID(包括其他未来的扩展)?

To merge a list of tokens back into Doc you may wish to try:要将令牌列表合并回Doc您可能希望尝试:

import spacy
nlp = spacy.load("en_core_web_sm")
txt = "This is some text"
doc = nlp(txt)
words = [tok.text for tok in doc]

spaces = [True if tok.whitespace_ else False for tok in doc]
doc2 = spacy.tokens.doc.Doc(nlp.vocab, words=words, spaces=spaces)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM