將 Spacy 訓練數據格式轉換為 Spacy CLI 格式（用於空白 NER）

Question

這是經典的培訓形式。

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

我曾經使用代碼進行訓練，但據我了解，使用 CLI 訓練方法進行的訓練效果更好。 但是，我的格式是這樣的。

我已經找到了用於這種類型轉換的代碼片段，但它們中的每一個都在執行spacy.load('en')而不是使用空白 - 這讓我想到，他們是在訓練現有模型而不是空白嗎？

這個塊看起來很簡單：

import spacy
from spacy.gold import docs_to_json
import srsly

nlp = spacy.load('en', disable=["ner"]) # as you see it's loading 'en' which I don't have
TRAIN_DATA = #data from above

docs = []
for text, annot in TRAIN_DATA:
    doc = nlp(text)
    doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
    docs.append(doc)

srsly.write_json("ent_train_data.json", [docs_to_json(docs)])

運行此代碼會拋出我：找不到模型“en”。 它似乎不是快捷方式鏈接、Python 包或數據目錄的有效路徑。

我很困惑如何將它與空白的spacy train一起使用。 只需使用spacy.blank('en') ？ 但是那么disable=["ner"]標志呢？

編輯：

如果我嘗試spacy.blank('en') ，我會收到Can't import language target from spacy.lang: No module named 'spacy.lang.en'

編輯 2 ：我嘗試加載en_core_web_sm

nlp = spacy.load('en_core_web_sm')

docs = []
for text, annot in TRAIN_DATA:
    doc = nlp(text)
    doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
    docs.append(doc)

srsly.write_json("ent_train_data.json", [docs_to_json(docs)])

類型錯誤：“NoneType”類型的對象沒有 len()

艾爾頓 - print(text[start:end])

目標！ FK Qarabag 1, Partizani Tirana 0. Filip Ozobic - FK Qarabag - 從禁區中心向球門中心射門。 協助 - 艾爾頓 - print(text)

無 - doc.ents =...行

類型錯誤：“NoneType”類型的對象沒有 len()

編輯 3 ：來自 Ines 的評論

nlp = spacy.load('en_core_web_sm')

docs = []
for text, annot in TRAIN_DATA:

    doc = nlp(text)

    tags = biluo_tags_from_offsets(doc, annot['entities'])
    docs.append(doc)

srsly.write_json(train_name + "_spacy_format.json", [docs_to_json(docs)])

這創建了 json，但我在生成的 json 中沒有看到任何標記的實體。

Answer 1

編輯 3 已接近尾聲，但您缺少將實體添加到文檔的步驟。 這應該有效：

import spacy
import srsly
from spacy.gold import docs_to_json, biluo_tags_from_offsets, spans_from_biluo_tags

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
    doc = nlp(text)
    tags = biluo_tags_from_offsets(doc, annot['entities'])
    entities = spans_from_biluo_tags(doc, tags)
    doc.ents = entities
    docs.append(doc)

srsly.write_json("spacy_format.json", [docs_to_json(docs)])

添加一個內置函數來進行這種轉換會很好，因為想要從示例腳本（這只是簡單的演示）轉移到訓練 CLI 是很常見的。

編輯：

您還可以跳過對內置 BILUO 轉換器的間接使用，並使用上面的內容：

    doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]

Answer 2

import spacy
import srsly
from spacy.training import docs_to_json, offsets_to_biluo_tags, biluo_tags_to_spans

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

nlp = spacy.load('en_core_web_lg')
docs = []
for text, annot in training_sub:
    doc = nlp(text)
    tags = offsets_to_biluo_tags(doc, annot['entities'])
    entities = biluo_tags_to_spans(doc, tags)
    doc.ents = entities
    docs.append(doc)

srsly.write_json("spacy_format.json", [docs_to_json(docs)])

從 spaCy v3.1 開始，上面的代碼有效。 spacy.gold一些相關方法已重命名並遷移到spacy.training 。

將 Spacy 訓練數據格式轉換為 Spacy CLI 格式（用於空白 NER）

問題描述

2 個解決方案

解決方案1
6 已采納 2019-12-06 08:20:54

解決方案2
2 2021-07-14 22:02:13

將 Spacy 訓練數據格式轉換為 Spacy CLI 格式（用於空白 NER）

問題描述

2 個解決方案

解決方案1 6 已采納 2019-12-06 08:20:54

解決方案2 2 2021-07-14 22:02:13

解決方案1
6 已采納 2019-12-06 08:20:54

解決方案2
2 2021-07-14 22:02:13