简体   繁体   中英

Is it possible to merge a list of spacy tokens into a doc

I have a document which I've tokenized using Spacy tokenizer. I want to apply ner on a sequence of tokens(a section of this document).

Currently I'm creating a doc first and then applying ner

nlp = spacy.load("en_core_web_sm")
# tokens_list is a list of Spacy tokens
words = [tok.text for tok in tokens_list]

spaces = [True if tok.whitespace_ else False for tok in tokens_list]

doc = spacy.tokens.doc.Doc(blackstone_nlp.vocab,
      words=words, spaces=spaces)
doc = nlp.get_pipe("ner")(doc)

But this is not ideal because I loose their original ids within the document, which is important.

Is there a way to merge tokens into a doc and still maintain their ids(including other future extensions )?

To merge a list of tokens back into Doc you may wish to try:

import spacy
nlp = spacy.load("en_core_web_sm")
txt = "This is some text"
doc = nlp(txt)
words = [tok.text for tok in doc]

spaces = [True if tok.whitespace_ else False for tok in doc]
doc2 = spacy.tokens.doc.Doc(nlp.vocab, words=words, spaces=spaces)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM