Is it possible to merge a list of spacy tokens into a doc

Question

I have a document which I've tokenized using Spacy tokenizer. I want to apply ner on a sequence of tokens(a section of this document).

Currently I'm creating a doc first and then applying ner

nlp = spacy.load("en_core_web_sm")
# tokens_list is a list of Spacy tokens
words = [tok.text for tok in tokens_list]

spaces = [True if tok.whitespace_ else False for tok in tokens_list]

doc = spacy.tokens.doc.Doc(blackstone_nlp.vocab,
      words=words, spaces=spaces)
doc = nlp.get_pipe("ner")(doc)

But this is not ideal because I loose their original ids within the document, which is important.

Is there a way to merge tokens into a doc and still maintain their ids(including other future extensions )?

Answer 1

To merge a list of tokens back into Doc you may wish to try:

import spacy
nlp = spacy.load("en_core_web_sm")
txt = "This is some text"
doc = nlp(txt)
words = [tok.text for tok in doc]

spaces = [True if tok.whitespace_ else False for tok in doc]
doc2 = spacy.tokens.doc.Doc(nlp.vocab, words=words, spaces=spaces)

Is it possible to merge a list of spacy tokens into a doc

Question

1 answers

solution1
0 2020-10-06 15:31:36

Is it possible to merge a list of spacy tokens into a doc

Question

1 answers

solution1 0 2020-10-06 15:31:36

solution1
0 2020-10-06 15:31:36