简体   繁体   English

将 NER 训练数据转换为 Spacy 训练数据格式

[英]Converting NER training data to Spacy training data format

I am creating an Indonesian NER model using Spacy.我正在使用 Spacy 创建印度尼西亚 NER model。 I'm using training data from https://raw.githubusercontent.com/yohanesgultom/nlp-experiments/master/data/ner/training_data.txt我正在使用来自https://raw.githubusercontent.com/yohanesgultom/nlp-experiments/master/data/ner/training_data.txt的训练数据

Above training data using this Tag format:以上训练数据使用此标签格式:

Sementara itu Pengamat Pasar Modal <ENAMEX TYPE="PERSON">Dandossi Matram</ENAMEX> mengatakan,

I wanted to convert this training data to Spacy format that is:我想将此训练数据转换为 Spacy 格式,即:

[('Sementara itu Pengamat Pasar Modal Dandossi Matram mengatakan,',{"entities:"([35, 51, 'PERSON'])})]

I'm still new to Python library, any idea how to convert the train data?我还是 Python 库的新手,知道如何转换火车数据吗? Or any idea to use which library?或者任何想法使用哪个库?

Thank you.谢谢你。

For simple XML-type annotations you can use BeautifulSoup.对于简单的 XML 类型注释,您可以使用 BeautifulSoup。 Here's an example with slightly simpler markup:这是一个稍微简单的标记示例:

from bs4 import BeautifulSoup

raw = "I went to <PLACE>Tokyo 3</PLACE> last year."
soup = BeautifulSoup(raw, features="html.parser")

out = ""
tags = []
idx = 0
for el in soup:
    text = el
    if hasattr(el, "text"):
        # it's a tag, save it
        text = el.text
        start = idx
        end = idx + len(el.text)
        tags.append( (el.name, start, end) )

    out += text
    idx += len(text)

print(out)
for tag in tags:
    print(tag[0], out[tag[1]:tag[2]], sep="\t")

Once you have the character spans like this example code gives, getting the spaCy format data is straightforward.一旦你有了这个示例代码给出的字符跨度,获取 spaCy 格式数据就很简单了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM