简体   繁体   中英

Creating doc objects efficiently with a spacy-stanza model

The most efficient way to create doc objects from a list of texts is the following, according to the creators of SPACY :

docs = list(nlp(texts))

where:

nlp : the trained SPACY model
texts : a list of texts that we want to convert to doc objects
docs : a list of doc objects derived from the list texts

However, when I use this code with a spacy-stanza language model I get an error message:

AssertionError: If neither 'pretokenized' or 'no_ssplit' option is enabled, the input to the TokenizerProcessor must be a string.

What would be your advice?

The stanza library doesn't have a good solution for batching, so nlp.pipe() with a stanza model isn't going to help the performance like it would with a spacy model.

The stanza developers' only suggestion is to use "\\n\\n" to separate texts, process them as one text, and then deal with splitting the predictions back into individual docs afterwards. On the spacy side of things, span.as_doc() might be helpful if you can identify the token indices where each document starts and ends, so:

span = huge_doc[start_token_index:end_token_index]
single_doc = span.as_doc()

Be aware that if the start/end indices are in the middle of a parse rather than at a sentence boundary, span.as_doc() will make adjustments that will change the original analysis so that the single doc has a valid parse.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM