如何將組合的 spacy ner 標簽轉換為 BIO 格式？

Question

我如何將其轉換為 BIO 格式？ 我試過使用 spacy biluo_tags_from_offsets但它未能捕獲所有實體，我想我知道原因。

tags = biluo_tags_from_offsets(doc, annot['entities'])

BSc（理學學士）——這兩者結合在一起，但當有空格時，spacy split the text。 所以現在這些詞就像 ( BSc(Bachelor, of, science ) 這就是為什么 spacy biluo_tags_from_offsets失敗並返回-

現在，當它檢查(80, 83, 'Degree')時，它無法單獨找到 BSc 詞。 同樣，它會再次失敗(84, 103, 'Degree') 。

我該如何解決這些情況？ 如果有人可以，請提供幫助。

EDUCATION: · Master of Computer Applications (MCA) from NV, *********, *****. · BSc(Bachelor of science) from NV, *********, *****

{'entities': [(13, 44, 'Degree'), (46, 49, 'Degree'), (80, 83, 'Degree'), (84, 103, 'Degree')]}

Answer 1

通常，您將數據傳遞到biluo_tags_from_offsets(doc, entities) ，其中entities類似於[(14, 44, 'ORG'), (51, 54, 'ORG')] 。 您可以根據需要編輯此參數（您可以從編輯doc.ents開始，也可以從那里繼續）。 您可以添加、刪除、組合此列表中的任何實體，如下例所示：

import spacy
from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_md")

text = "I have a BSc (Bachelors of Computer Sciences) from NYU"
doc = nlp(text)
print("Entities before adding new entity:", doc.ents)

entities = []
for ent in doc.ents:
    entities.append((ent.start_char, ent.end_char, ent.label_))
print("BILUO before adding new entity:", biluo_tags_from_offsets(doc, entities))

entities.append((9,12,'ORG')) # add a desired entity

print("BILUO after adding new entity:", biluo_tags_from_offsets(doc, entities))

Entities before adding new entity: (Bachelors of Computer Sciences, NYU)
BILUO before adding new entity: ['O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'U-ORG']
BILUO after adding new entity: ['O', 'O', 'O', 'U-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'U-ORG']

如果您希望合並實體的過程基於規則，您可以使用以下簡化示例（取自上面的鏈接）嘗試entityruler ：

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "Apple"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents])

然后再次將重新定義的（在您的案例中合並）實體列表傳遞給biluo_tags_from_offsets ，就像在第一個代碼片段中一樣

如何將組合的 spacy ner 標簽轉換為 BIO 格式？

問題描述

1 個解決方案

解決方案1
2 2020-09-23 11:46:16

如何將組合的 spacy ner 標簽轉換為 BIO 格式？

問題描述

1 個解決方案

解決方案1 2 2020-09-23 11:46:16

解決方案1
2 2020-09-23 11:46:16