[英]Is it possible to merge a list of spacy tokens into a doc
I have a document which I've tokenized using Spacy tokenizer.我有一个使用 Spacy 标记器标记的文档。 I want to apply
ner
on a sequence of tokens(a section of this document).我想将
ner
应用于一系列令牌(本文档的一部分)。
Currently I'm creating a doc first and then applying ner
目前我正在创建一个文档,然后应用
ner
nlp = spacy.load("en_core_web_sm")
# tokens_list is a list of Spacy tokens
words = [tok.text for tok in tokens_list]
spaces = [True if tok.whitespace_ else False for tok in tokens_list]
doc = spacy.tokens.doc.Doc(blackstone_nlp.vocab,
words=words, spaces=spaces)
doc = nlp.get_pipe("ner")(doc)
But this is not ideal because I loose their original ids within the document, which is important.但这并不理想,因为我在文档中丢失了它们的原始 ID,这很重要。
Is there a way to merge tokens into a doc and still maintain their ids(including other future extensions )?有没有办法将令牌合并到文档中并仍然保持它们的 ID(包括其他未来的扩展)?
To merge a list of tokens back into Doc
you may wish to try:要将令牌列表合并回
Doc
您可能希望尝试:
import spacy
nlp = spacy.load("en_core_web_sm")
txt = "This is some text"
doc = nlp(txt)
words = [tok.text for tok in doc]
spaces = [True if tok.whitespace_ else False for tok in doc]
doc2 = spacy.tokens.doc.Doc(nlp.vocab, words=words, spaces=spaces)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.