简体   繁体   中英

How to convert spaCy NER dataset format to Flair format?

I have already labled a dataset using dataturks to train a spaCy NER and everything works fine, however, I just realized that Flair has a different format and I am just wondering if there is a way to convert my "spaCy's NER" json dataset format into the Flair format:

George N B-PER
Washington N I-PER
went VO
to PO
Washington N B-LOC

However the spaCy's format will be as follow:

[("George Washington went to Washington",
{'entities': [(0, 6,'PER'),(7, 17,'PER'),(26, 36,'LOC')]})]

Flair uses BILUO scheme, with empty line between sentences, so you would need to use bliuo_tags_from_offsets :

import spacy
from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_md")

ents = [("George Washington went to Washington",{'entities': [(0, 6,'PER'),(7, 17,'PER'),(26, 36,'LOC')]}),
         ("Uber blew through $1 million a week", {'entities':[(0, 4, 'ORG')]}),
       ]

with open("flair_ner.txt","w") as f:
    for sent,tags in ents:
        doc = nlp(sent)
        biluo = biluo_tags_from_offsets(doc,tags['entities'])
        for word,tag in zip(doc, biluo):
            f.write(f"{word} {tag}\n")
        f.write("\n")

Output:

George U-PER
Washington U-PER
went O
to O
Washington U-LOC

Uber U-ORG
blew O
through O
$ O
1 O
million O
a O
week O

Note, to train just NER this seem to be enough. If you wish to add pos tagging, you would need to create a mapping from Universal Pos Tags to Flair simplified scheme. For example:

tag_mapping = {'PROPN':'N','VERB':'V','ADP':'P','NOUN':'N'} # create your own
with open("flair_ner.txt","w") as f:
    for pair in ents:
        sent,tags = pair
        doc = nlp(sent)
        biluo = biluo_tags_from_offsets(doc,tags['entities'])
        try:
            for word,tag in zip(doc, biluo):
                f.write(f"{word} {tag_mapping[word.pos_]} {tag}\n")
#                 f.write(f"{word} {tag_mapping.get(word.pos_,'None')} {tag}\n")
        except KeyError:
            print(f"''{word.pos_}' tag is not defined in tag_mapping")
        f.write("\n")

Output:

''SYM' tag is not defined in tag_mapping'

The main data format used in spaCy v3.0 is binary fomrat with extension.spacy. The JSON format is deprecated. To convert train.spacy in BILUO annotation to flair format I created a corpus.

import spacy
from spacy.training import Corpus

nlp = spacy.load("de_core_news_sm")
corpus = Corpus("route/to/train.spacy")

data = corpus(nlp)

# Flair supports BIO and BIOES, see https://github.com/flairNLP/flair/issues/875
def rename_biluo_to_bioes(old_tag):
    new_tag = ""
    try:
        if old_tag.startswith("L"):
            new_tag = "E" + old_tag[1:]
        elif old_tag.startswith("U"):
            new_tag = "S" + old_tag[1:]
        else:
            new_tag = old_tag
    except:
        pass
    return new_tag


def generate_corpus():
    corpus = []
    n_ex = 0
    for example in data:
        n_ex += 1
        text = example.text
        doc = nlp(text)
        tags = example.get_aligned_ner()
        # Check if it's an empty list of NER tags.
        if None in tags:
            pass
        else:
            new_tags = [rename_biluo_to_bioes(tag) for tag in tags]
            for token, tag in zip(doc,new_tags):
                row = token.text +' '+ token.pos_ +' ' +tag + '\n'
                corpus.append(row)
            corpus.append('\n')
    return corpus

def write_file(filepath):
    with open(filepath, 'w', encoding='utf-8') as f:
        corpus = generate_corpus()
        f.writelines(corpus)
        
def main():
    write_file('./data/train.txt')

if __name__ == '__main__':
    main()   

I hope it works. It is some time ago though.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM