简体   繁体   中英

How to convert combined spacy ner tags to BIO format?

How can I convert this into BIO format? I have tried using spacy biluo_tags_from_offsets but it's failing to catch all entities and I think I know the reason why.

tags = biluo_tags_from_offsets(doc, annot['entities'])

BSc(Bachelor of science) - These two are combined together but spacy split the text when there is a space. So now the words will be like ( BSc(Bachelor, of, science ) and this is why spacy biluo_tags_from_offsets failing and return -

Now, when it checks for (80, 83, 'Degree') It can't find BSc word alone. Similarly it will again fail for (84, 103, 'Degree') .

How can I fix these scenarios? Please help if anyone can.


EDUCATION: · Master of Computer Applications (MCA) from NV, *********, *****. · BSc(Bachelor of science) from NV, *********, *****

{'entities': [(13, 44, 'Degree'), (46, 49, 'Degree'), (80, 83, 'Degree'), (84, 103, 'Degree')]}

In general, you pass your data into biluo_tags_from_offsets(doc, entities) , where entities are like [(14, 44, 'ORG'), (51, 54, 'ORG')] . You can edit this param as you wish (you can start with editing doc.ents and proceed from there as well). You may add, remove, combine any entities in this list like in below example:

import spacy
from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_md")

text = "I have a BSc (Bachelors of Computer Sciences) from NYU"
doc = nlp(text)
print("Entities before adding new entity:", doc.ents)

entities = []
for ent in doc.ents:
    entities.append((ent.start_char, ent.end_char, ent.label_))
print("BILUO before adding new entity:", biluo_tags_from_offsets(doc, entities))

entities.append((9,12,'ORG')) # add a desired entity

print("BILUO after adding new entity:", biluo_tags_from_offsets(doc, entities))

Entities before adding new entity: (Bachelors of Computer Sciences, NYU)
BILUO before adding new entity: ['O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'U-ORG']
BILUO after adding new entity: ['O', 'O', 'O', 'U-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'U-ORG']

If you want the process of merging entities to be rule based you can try entityruler with the following simplified example (taken from above link):

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "Apple"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents])

and then again pass the list of redefined (merged in your case) entities to biluo_tags_from_offsets , like in the first code snippet

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM