简体   繁体   中英

How to use spacy train to add entities to an existing custom NER model? (Spacy v3.0)

I am currently implementing a custom NER model interface where a user can interact with a frontend application to add custom entities to train a spacy model.

I want to use spacy train (CLI) to take an existing model (custom NER model) and add the keyword and entity specified by the user, to that model. (Instead of training the whole model again). I can't find this anywhere in the documentation.

For example, let's say I have a model that is already trained for a custom entity of FOOD. (Pizza, Pasta, Bread, etc…). Now I want to take this existing model, and train it for a new entity called DRINKS with keywords like Coca-Cola, Pepsi, Juice, etc… Using spacy train command for spacy v3.0.

The spacy train command that I am using currently is as follows:

> python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy

I load the model for prediction using:

> nlp1 = spacy.load(R".\output\model-best")

As of now, I was training the model for new entities manually. Below is the code to find keywords in my training data and output a JSON format for training data (old format).

import re

keyword = ["outages","updates","negative star","worst"]
entity = ["PROBLEM","PROBLEM","COMPLAINT","COMPLAINT"]

train = []

for text in df.text:

    for n in range(0,len(keyword)):
    
        start_index = []
        end_index = []

        start_index = [m.start() for m in re.finditer(keyword[n], str(text))]

        if(start_index):

            end_index = [m+len(keyword[n]) for m in start_index]

            for i in range(0,len(start_index)):

                train.append((text,{"entities": [(start_index[i],end_index[i],entity[n])]}))

train

After this, I converted my json format into .spacy format with below code.

from tqdm import tqdm
from spacy.tokens import DocBin

db = DocBin() # create a DocBin object

for text, annot in tqdm(train): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents # label the text with the ents
    db.add(doc)

db.to_disk("./train.spacy")

I want to use spacy train (CLI) to take an existing model (custom NER model) and add the keyword and entity specified by the user, to that model. (Instead of training the whole model again). I can't find this anywhere in the documentation.

What you are describing is called "online learning" and the default spaCy models don't support it. Most modern neural NER methods, even outside of spaCy, have no support for it at all.

You cannot fix this by using a custom training loop.

Your options are to use rule-based matching, so you can only match things explicitly in a list, or to retrain models on the fly.

Rule-based matching should be easy to set up but has the obvious issue that it can't learn things not explicitly in the list.

Training things on the fly may sound like it'll take too long, but you can train a small model quite quickly. What you can do is train a small model for a small number of iterations while the user is working interactively, and after they've confirmed the model is more or less working correctly you can use the same training data for a larger model with longer training.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM