简体   繁体   English

如何将组合的 spacy ner 标签转换为 BIO 格式?

[英]How to convert combined spacy ner tags to BIO format?

How can I convert this into BIO format?我如何将其转换为 BIO 格式? I have tried using spacy biluo_tags_from_offsets but it's failing to catch all entities and I think I know the reason why.我试过使用 spacy biluo_tags_from_offsets但它未能捕获所有实体,我想我知道原因。

tags = biluo_tags_from_offsets(doc, annot['entities'])

BSc(Bachelor of science) - These two are combined together but spacy split the text when there is a space. BSc(理学学士)——这两者结合在一起,但当有空格时,spacy split the text。 So now the words will be like ( BSc(Bachelor, of, science ) and this is why spacy biluo_tags_from_offsets failing and return -所以现在这些词就像 ( BSc(Bachelor, of, science ) 这就是为什么 spacy biluo_tags_from_offsets失败并返回-

Now, when it checks for (80, 83, 'Degree') It can't find BSc word alone.现在,当它检查(80, 83, 'Degree')时,它无法单独找到 BSc 词。 Similarly it will again fail for (84, 103, 'Degree') .同样,它会再次失败(84, 103, 'Degree')

How can I fix these scenarios?我该如何解决这些情况? Please help if anyone can.如果有人可以,请提供帮助。

EDUCATION: · Master of Computer Applications (MCA) from NV, *********, *****. · BSc(Bachelor of science) from NV, *********, *****

{'entities': [(13, 44, 'Degree'), (46, 49, 'Degree'), (80, 83, 'Degree'), (84, 103, 'Degree')]}

In general, you pass your data into biluo_tags_from_offsets(doc, entities) , where entities are like [(14, 44, 'ORG'), (51, 54, 'ORG')] .通常,您将数据传递到biluo_tags_from_offsets(doc, entities) ,其中entities类似于[(14, 44, 'ORG'), (51, 54, 'ORG')] You can edit this param as you wish (you can start with editing doc.ents and proceed from there as well).您可以根据需要编辑此参数(您可以从编辑doc.ents开始,也可以从那里继续)。 You may add, remove, combine any entities in this list like in below example:您可以添加、删除、组合此列表中的任何实体,如下例所示:

import spacy
from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_md")

text = "I have a BSc (Bachelors of Computer Sciences) from NYU"
doc = nlp(text)
print("Entities before adding new entity:", doc.ents)

entities = []
for ent in doc.ents:
    entities.append((ent.start_char, ent.end_char, ent.label_))
print("BILUO before adding new entity:", biluo_tags_from_offsets(doc, entities))

entities.append((9,12,'ORG')) # add a desired entity

print("BILUO after adding new entity:", biluo_tags_from_offsets(doc, entities))

Entities before adding new entity: (Bachelors of Computer Sciences, NYU)
BILUO before adding new entity: ['O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'U-ORG']
BILUO after adding new entity: ['O', 'O', 'O', 'U-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'U-ORG']

If you want the process of merging entities to be rule based you can try entityruler with the following simplified example (taken from above link):如果您希望合并实体的过程基于规则,您可以使用以下简化示例(取自上面的链接)尝试entityruler

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "Apple"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]

doc = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents])

and then again pass the list of redefined (merged in your case) entities to biluo_tags_from_offsets , like in the first code snippet然后再次将重新定义的(在您的案例中合并)实体列表传递给biluo_tags_from_offsets ,就像在第一个代码片段中一样

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM