简体   繁体   中英

Adding more custom entities into pretrained custom NER Spacy3

I've a huge amount of textual data and wanted to add around 50 different entities. Initially when I started working with it, I was getting memory error. As we know spacy can handle 1,00,000 tokens per GB and maximum up to 10,00,000. So I chunked my dataset into 5 sets and using annotator created multiple JSON file for the same. Now I started with one JSON and successfully completed creating the model and now I want to add more data into it so that I don't miss out any tags and there's a good variety of data is used while training in the model. Please guide me how to proceed next.

I mentioned some points of confusion in a comment, but assuming that your issue is how to load a large training set into spaCy, the solution is pretty simple.

First, save your training data as multiple .spacy files in one directory. You do not have to make JSON files, that was standard in v2. For details on training data see the training data section of the docs . In your config you can specify this directory as the training data source and spaCy will use all the files there.

Next, to avoid keeping all the training data in memory, you can specify max_epochs = -1 (see the docs on streaming corpora ). Using this feature means you will have to specify your labels ahead of time as covered in the docs there. You will probably also want to shuffle your training data manually.

That's all you need to train with a lot of data.

The title of your question mentions adding entities to the pretrained model. It's usually better to train from scratch instead to avoid catastrophic forgetting, but you can see a guide to doing it here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM