簡體   English   中英

如何將 spaCy NER 數據集格式轉換為 Flair 格式?

[英]How to convert spaCy NER dataset format to Flair format?

我已經使用 dataturks 標記了一個數據集來訓練spaCy NER,一切正常,但是,我剛剛意識到Flair有不同的格式,我只是想知道是否有辦法將我的“spaCy's NER”json 數據集格式轉換為Flair格式:

喬治 N B-PER
華盛頓 N I-PER
去VO
到采購訂單
華盛頓 N B-LOC

然而 spaCy 的格式如下:

[("喬治華盛頓去了華盛頓",
{'實體': [(0, 6,'PER'),(7, 17,'PER'),(26, 36,'LOC')]})]

Flair使用BILUO方案,句子之間有空行,所以你需要使用bliuo_tags_from_offsets

import spacy
from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_md")

ents = [("George Washington went to Washington",{'entities': [(0, 6,'PER'),(7, 17,'PER'),(26, 36,'LOC')]}),
         ("Uber blew through $1 million a week", {'entities':[(0, 4, 'ORG')]}),
       ]

with open("flair_ner.txt","w") as f:
    for sent,tags in ents:
        doc = nlp(sent)
        biluo = biluo_tags_from_offsets(doc,tags['entities'])
        for word,tag in zip(doc, biluo):
            f.write(f"{word} {tag}\n")
        f.write("\n")

Output:

George U-PER
Washington U-PER
went O
to O
Washington U-LOC

Uber U-ORG
blew O
through O
$ O
1 O
million O
a O
week O

請注意,僅訓練NER似乎就足夠了。 如果您希望添加 pos 標記,則需要創建從Universal Pos Tags到 Flair 簡化方案的映射。 例如:

tag_mapping = {'PROPN':'N','VERB':'V','ADP':'P','NOUN':'N'} # create your own
with open("flair_ner.txt","w") as f:
    for pair in ents:
        sent,tags = pair
        doc = nlp(sent)
        biluo = biluo_tags_from_offsets(doc,tags['entities'])
        try:
            for word,tag in zip(doc, biluo):
                f.write(f"{word} {tag_mapping[word.pos_]} {tag}\n")
#                 f.write(f"{word} {tag_mapping.get(word.pos_,'None')} {tag}\n")
        except KeyError:
            print(f"''{word.pos_}' tag is not defined in tag_mapping")
        f.write("\n")

Output:

''SYM' tag is not defined in tag_mapping'

spaCy v3.0 中使用的主要數據格式是帶有擴展名.spacy 的二進制格式。 JSON 格式已棄用。 為了將 BILUO 注釋中的 train.spacy 轉換為 fair 格式,我創建了一個語料庫。

import spacy
from spacy.training import Corpus

nlp = spacy.load("de_core_news_sm")
corpus = Corpus("route/to/train.spacy")

data = corpus(nlp)

# Flair supports BIO and BIOES, see https://github.com/flairNLP/flair/issues/875
def rename_biluo_to_bioes(old_tag):
    new_tag = ""
    try:
        if old_tag.startswith("L"):
            new_tag = "E" + old_tag[1:]
        elif old_tag.startswith("U"):
            new_tag = "S" + old_tag[1:]
        else:
            new_tag = old_tag
    except:
        pass
    return new_tag


def generate_corpus():
    corpus = []
    n_ex = 0
    for example in data:
        n_ex += 1
        text = example.text
        doc = nlp(text)
        tags = example.get_aligned_ner()
        # Check if it's an empty list of NER tags.
        if None in tags:
            pass
        else:
            new_tags = [rename_biluo_to_bioes(tag) for tag in tags]
            for token, tag in zip(doc,new_tags):
                row = token.text +' '+ token.pos_ +' ' +tag + '\n'
                corpus.append(row)
            corpus.append('\n')
    return corpus

def write_file(filepath):
    with open(filepath, 'w', encoding='utf-8') as f:
        corpus = generate_corpus()
        f.writelines(corpus)
        
def main():
    write_file('./data/train.txt')

if __name__ == '__main__':
    main()   

我希望它有效。 雖然是前一段時間了。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM