简体   繁体   中英

Problem adding custom entities to SpaCy's NER

  • I added a new entity called "orgName" to en_core_web_lg using https://spacy.io/usage/training#example-new-entity-type
  • All my training data (26k sentences) have the "orgName" labeled in them.
  • To deal with the catastrophic forgetting problem, I ran en_core_web_lg on those 26k raw sentences and added the ORG, PROD, FAC, etc. entities as labels and not face the colliding entities, I created duplicates. So, for a sentence A which was labeled by "orgName", I created a duplicate A2 which has ORG, PROD, FAC, etc. ending up with about 52k sentences.
  • I trained using 100 iterations.

Now, the problem is that testing the model even on the training sentences, it's not showing the ORG, PROD, FAC, etc. but only showing "orgName".

Where do you think the problem is?

In principle the way you're trying to solve the catastrophic forgetting problem, by retraining it on its old predictions, seems like a good approach to me.

However, if you are having duplicate versions of the same sentence, but annotated differently, and feeding that to the NER classifier, you may confuse the model. The reason is that it doesn't just look at the positive examples, but also explicitely sees non-annotated words as negative cases.

So if you have "Bob lives in London", and you only annotate "London", then it will think Bob is surely not an NE. If then you have a second sentence where you annotate only Bob, it will "unlearn" that London is an NE, because now it's not annotated as such. So consistency really is important.

I would suggest to implement a more advanced algorithm to resolve the conflicts. One option is to always just take the annotated entity with the longest Span . But if the Spans are often exactly the same, you may need to reconsider your label scheme. Which entities collide most often? I would assume ORG and OrgName? Do you really need ORG? Perhaps the two can be "merged" as the same entity?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM