将 Spacy 训练数据格式转换为 Spacy CLI 格式（用于空白 NER）

Question

This is the classic training format.这是经典的培训形式。

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

I used to train with code but as I understand, the training is better with CLI train method.我曾经使用代码进行训练，但据我了解，使用 CLI 训练方法进行的训练效果更好。 However, my format is this.但是，我的格式是这样的。

I have found code-snippets for this type of conversion but every one of them is performing spacy.load('en') rather than going with blank - which made me think, are they training existing model rather than blank?我已经找到了用于这种类型转换的代码片段，但它们中的每一个都在执行spacy.load('en')而不是使用空白 - 这让我想到，他们是在训练现有模型而不是空白吗？

This chunk seems pretty easy:这个块看起来很简单：

import spacy
from spacy.gold import docs_to_json
import srsly

nlp = spacy.load('en', disable=["ner"]) # as you see it's loading 'en' which I don't have
TRAIN_DATA = #data from above

docs = []
for text, annot in TRAIN_DATA:
    doc = nlp(text)
    doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
    docs.append(doc)

srsly.write_json("ent_train_data.json", [docs_to_json(docs)])

Running this code throws me: Can't find model 'en'.运行此代码会抛出我：找不到模型“en”。 It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.它似乎不是快捷方式链接、Python 包或数据目录的有效路径。

I am quite confused how to use it with spacy train on blank.我很困惑如何将它与空白的spacy train一起使用。 Just use spacy.blank('en') ?只需使用spacy.blank('en') ？ But then what about disable=["ner"] flag?但是那么disable=["ner"]标志呢？

Edit:编辑：

If I try spacy.blank('en') instead, i receive Can't import language goal from spacy.lang: No module named 'spacy.lang.en'如果我尝试spacy.blank('en') ，我会收到Can't import language target from spacy.lang: No module named 'spacy.lang.en'

Edit 2 : I have tried loading en_core_web_sm编辑 2 ：我尝试加载en_core_web_sm

nlp = spacy.load('en_core_web_sm')

docs = []
for text, annot in TRAIN_DATA:
    doc = nlp(text)
    doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
    docs.append(doc)

srsly.write_json("ent_train_data.json", [docs_to_json(docs)])

TypeError: object of type 'NoneType' has no len()类型错误：“NoneType”类型的对象没有 len()

Ailton - print(text[start:end])艾尔顿 - print(text[start:end])

Goal!目标！ FK Qarabag 1, Partizani Tirana 0. Filip Ozobic - FK Qarabag - shot with the head from the centre of the box to the centre of the goal. FK Qarabag 1, Partizani Tirana 0. Filip Ozobic - FK Qarabag - 从禁区中心向球门中心射门。 Assist - Ailton - print(text)协助 - 艾尔顿 - print(text)

None - doc.ents =... line无 - doc.ents =...行

TypeError: object of type 'NoneType' has no len()类型错误：“NoneType”类型的对象没有 len()

Edit 3 : From Ines' comment编辑 3 ：来自 Ines 的评论

nlp = spacy.load('en_core_web_sm')

docs = []
for text, annot in TRAIN_DATA:

    doc = nlp(text)

    tags = biluo_tags_from_offsets(doc, annot['entities'])
    docs.append(doc)

srsly.write_json(train_name + "_spacy_format.json", [docs_to_json(docs)])

This created the json but I don't see any of my tagged entities in the generated json.这创建了 json，但我在生成的 json 中没有看到任何标记的实体。

Answer 1

Edit 3 is close, but you're missing a step where you add the entities to the document.编辑 3 已接近尾声，但您缺少将实体添加到文档的步骤。 This should work:这应该有效：

import spacy
import srsly
from spacy.gold import docs_to_json, biluo_tags_from_offsets, spans_from_biluo_tags

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
    doc = nlp(text)
    tags = biluo_tags_from_offsets(doc, annot['entities'])
    entities = spans_from_biluo_tags(doc, tags)
    doc.ents = entities
    docs.append(doc)

srsly.write_json("spacy_format.json", [docs_to_json(docs)])

It would be good to add a built-in function to do this conversion, since it's common to want to shift from the example scripts (which are just meant to be simple demos) to the train CLI.添加一个内置函数来进行这种转换会很好，因为想要从示例脚本（这只是简单的演示）转移到训练 CLI 是很常见的。

Edit :编辑：

You can also skip the somewhat indirect use of the built-in BILUO converters and use what you had above:您还可以跳过对内置 BILUO 转换器的间接使用，并使用上面的内容：

    doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]

Answer 2

import spacy
import srsly
from spacy.training import docs_to_json, offsets_to_biluo_tags, biluo_tags_to_spans

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

nlp = spacy.load('en_core_web_lg')
docs = []
for text, annot in training_sub:
    doc = nlp(text)
    tags = offsets_to_biluo_tags(doc, annot['entities'])
    entities = biluo_tags_to_spans(doc, tags)
    doc.ents = entities
    docs.append(doc)

srsly.write_json("spacy_format.json", [docs_to_json(docs)])

As of spaCy v3.1, the above code works.从 spaCy v3.1 开始，上面的代码有效。 Some relevant methods from spacy.gold have been renamed and migrated to spacy.training . spacy.gold一些相关方法已重命名并迁移到spacy.training 。

将 Spacy 训练数据格式转换为 Spacy CLI 格式（用于空白 NER）

问题描述

2 个解决方案

解决方案1
6 已采纳 2019-12-06 08:20:54

解决方案2
2 2021-07-14 22:02:13

将 Spacy 训练数据格式转换为 Spacy CLI 格式（用于空白 NER）

问题描述

2 个解决方案

解决方案1 6 已采纳 2019-12-06 08:20:54

解决方案2 2 2021-07-14 22:02:13

解决方案1
6 已采纳 2019-12-06 08:20:54

解决方案2
2 2021-07-14 22:02:13