简体   繁体   English

如何导出“带有来自 spaCy 的实体的文档”以在 doccano 中使用

[英]How to export "Document with entities from spaCy" for use in doccano

I want to train my model with doccano or an other "Open source text annotation tool" and continuously improve my model.我想用 doccano 或其他“开源文本注释工具”训练我的模型,并不断改进我的模型。

For that my understanding is, that I can import annotated data to doccano in a format described here:为此,我的理解是,我可以以此处描述的格式将带注释的数据导入到 doccano: 多卡诺进口

So for a first step I have loaded a model and created a doc:所以第一步我已经加载了一个模型并创建了一个文档:

text = "Test text that should be annotated for Michael Schumacher" 
nlp = spacy.load('en_core_news_sm')
doc = nlp(text)

I know I can export the jsonl format (with text and annotated labels) from doccano and train a model with it but I want to know how to export that data from a spaCy doc in python so that i can import it to doccano.我知道我可以从 doccano 导出 jsonl 格式(带有文本和带注释的标签)并用它训练模型,但我想知道如何从 python 中的 spaCy doc 导出该数据,以便我可以将其导入到 doccano。

Thanks in advance.提前致谢。

I had a similar task recently, here is how I did it:我最近有一个类似的任务,我是这样做的:

import spacy
nlp = spacy.load('en_core_news_sm')

def text_to_doccano(text):
    """
    :text (str): source text
    Returns (list (dict)): deccano format json
    """
    djson = list()
    doc = nlp(text)
    for sent in doc.sents:
        labels = list()
        for e in sent.ents:
            labels.append([e.start_char, e.end_char, e.label_])
        djson.append({'text': sent.text, "labels": labels})
    return djson

Based on your example ...根据您的示例...

text = "Test text that should be annotated for Michael Schumacher."
djson = text_to_doccano(text)
print(djson)

... would print out: ...会打印出来:

[{'text': 'Test text that should be annotated for Michael Schumacher.', 'labels': [[39, 57, 'PERSON']]}]

On a related note, when you save the results to a file the standard json.dump approach for saving JSONs won't work as it would write it as a list of entries separated with commas.在相关说明中,当您将结果保存到文件时,用于保存 JSON 的标准json.dump方法将不起作用,因为它会将其写为用逗号分隔的条目列表。 AFAIK, doccano expects one entry per line and without a trailing comma. AFAIK, doccano期望每行一个条目,并且没有尾随逗号。 In resolving this, the following snippet works like charm:在解决这个问题时,下面的代码片段就像魅力一样:

import json

open(filepath, 'w').write("\n".join([json.dumps(e) for e in djson]))

/Cheers /干杯

Spacy doesn't support this exact format out-of-the-box, but you should be able to write a custom function fairly easily. Spacy 不支持这种开箱即用的确切格式,但您应该能够相当轻松地编写自定义函数。 Take a look at spacy.gold.docs_to_json() , which shows a similar conversion to JSON.查看spacy.gold.docs_to_json() ,它显示了与 JSON 的类似转换。

Doccano and/or spaCy seem to have changed things and there are now some flaws in the accepted answer. Doccano 和/或 spaCy 似乎改变了一些事情,现在接受的答案存在一些缺陷。 This revised version should be more correct with spaCy 3.1 and Doccano as of 8/1/2021...自 2021 年 8 月 1 日起,此修订版本应与 spaCy 3.1 和 Doccano 更正确...

def text_to_doccano(text):
    """
    :text (str): source text
    Returns (list (dict)): deccano format json
    """
    djson = list()
    doc = nlp(text)
    for sent in doc.sents:
        labels = list()
        for e in sent.ents:
            labels.append([e.start_char - sent.start_char, e.end_char - sent.start_char, e.label_])
        djson.append({'text': sent.text, "label": labels})
    return djson

The differences:区别:

  1. labels becomes singular label in the JSON (?!?) labels成为 JSON 中的单数label (?!?)
  2. e.start_char and e.end_char are actually (now?) the start and end within the document, not within the sentence...so you have to offset them by the position of the sentence within the document. e.start_chare.end_char实际上是(现在?)文档中的开始和结束,而不是句子中的......所以你必须通过文档中句子的位置来抵消它们。

I have used Doccano annotation tool, to generate annotation, I have exported .jsonl file from Doccano Converted to .spaCy training format using following cutomized code.我使用了 Doccano 注释工具来生成注释,我使用以下自定义代码将 .jsonl 文件从 Doccano 转换为 .spaCy 训练格式。

Step to Follow:要遵循的步骤:

Step 1 : Use doccano tool to annotate the data.第 1 步:使用 doccano 工具对数据进行注释。

Step 2 : Export annotation file from Doccano which is in .jsonl format.第 2 步:从 .jsonl 格式的 Doccano 中导出注释文件。

Step 3 : Pass that .jsonl file to fillterDoccanoData("./root.jsonl") function in below code, In my case I have root.jsonl for me, you can use your own.第 3 步:在下面的代码中将该 .jsonl 文件传递​​给fillterDoccanoData("./root.jsonl")函数,在我的情况下,我有root.jsonl ,您可以使用自己的。

Step 4 : User the following code to convert your .jsonl file to .spacy training file.第 4 步:使用以下代码将您的 .jsonl 文件转换为 .spacy 培训文件。

Step 5 : You can find train.spacy in your working directory as a result finally.第 5 步:您最终可以在您的工作目录中找到 train.spacy。

Thanks谢谢

import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
import logging
import json

#filtter data to convert in spacy format
def fillterDoccanoData(doccano_JSONL_FilePath):
    try:
        training_data = []
        lines=[]
        with open(doccano_JSONL_FilePath, 'r') as f:
            lines = f.readlines()

        for line in lines:
            data = json.loads(line)
            text = data['data']
            entities = data['label']
            if len(entities)>0:
                training_data.append((text, {"entities" : entities}))
        return training_data
    except Exception as e:
        logging.exception("Unable to process " + doccano_JSONL_FilePath + "\n" + "error = " + str(e))
        return None

#read Doccano Annotation file .jsonl
TRAIN_DATA=fillterDoccanoData("./root.jsonl") #root.jsonl is annotation file name file name 

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for text, annot in tqdm(TRAIN_DATA): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    try:
        doc.ents = ents # label the text with the ents
        db.add(doc)
    except:
        print(text, annot)
db.to_disk("./train.spacy") # save the docbin object

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM