I have already labled a dataset using dataturks to train a spaCy
NER and everything works fine, however, I just realized that Flair
has a different format and I am just wondering if there is a way to convert my "spaCy's NER" json dataset format into the Flair
format:
George N B-PER
Washington N I-PER
went VO
to PO
Washington N B-LOC
However the spaCy's format will be as follow:
[("George Washington went to Washington",
{'entities': [(0, 6,'PER'),(7, 17,'PER'),(26, 36,'LOC')]})]
Flair
uses BILUO
scheme, with empty line between sentences, so you would need to use bliuo_tags_from_offsets
:
import spacy
from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_md")
ents = [("George Washington went to Washington",{'entities': [(0, 6,'PER'),(7, 17,'PER'),(26, 36,'LOC')]}),
("Uber blew through $1 million a week", {'entities':[(0, 4, 'ORG')]}),
]
with open("flair_ner.txt","w") as f:
for sent,tags in ents:
doc = nlp(sent)
biluo = biluo_tags_from_offsets(doc,tags['entities'])
for word,tag in zip(doc, biluo):
f.write(f"{word} {tag}\n")
f.write("\n")
Output:
George U-PER
Washington U-PER
went O
to O
Washington U-LOC
Uber U-ORG
blew O
through O
$ O
1 O
million O
a O
week O
Note, to train just NER
this seem to be enough. If you wish to add pos tagging, you would need to create a mapping from Universal Pos Tags to Flair simplified scheme. For example:
tag_mapping = {'PROPN':'N','VERB':'V','ADP':'P','NOUN':'N'} # create your own
with open("flair_ner.txt","w") as f:
for pair in ents:
sent,tags = pair
doc = nlp(sent)
biluo = biluo_tags_from_offsets(doc,tags['entities'])
try:
for word,tag in zip(doc, biluo):
f.write(f"{word} {tag_mapping[word.pos_]} {tag}\n")
# f.write(f"{word} {tag_mapping.get(word.pos_,'None')} {tag}\n")
except KeyError:
print(f"''{word.pos_}' tag is not defined in tag_mapping")
f.write("\n")
Output:
''SYM' tag is not defined in tag_mapping'
The main data format used in spaCy v3.0 is binary fomrat with extension.spacy. The JSON format is deprecated. To convert train.spacy in BILUO annotation to flair format I created a corpus.
import spacy
from spacy.training import Corpus
nlp = spacy.load("de_core_news_sm")
corpus = Corpus("route/to/train.spacy")
data = corpus(nlp)
# Flair supports BIO and BIOES, see https://github.com/flairNLP/flair/issues/875
def rename_biluo_to_bioes(old_tag):
new_tag = ""
try:
if old_tag.startswith("L"):
new_tag = "E" + old_tag[1:]
elif old_tag.startswith("U"):
new_tag = "S" + old_tag[1:]
else:
new_tag = old_tag
except:
pass
return new_tag
def generate_corpus():
corpus = []
n_ex = 0
for example in data:
n_ex += 1
text = example.text
doc = nlp(text)
tags = example.get_aligned_ner()
# Check if it's an empty list of NER tags.
if None in tags:
pass
else:
new_tags = [rename_biluo_to_bioes(tag) for tag in tags]
for token, tag in zip(doc,new_tags):
row = token.text +' '+ token.pos_ +' ' +tag + '\n'
corpus.append(row)
corpus.append('\n')
return corpus
def write_file(filepath):
with open(filepath, 'w', encoding='utf-8') as f:
corpus = generate_corpus()
f.writelines(corpus)
def main():
write_file('./data/train.txt')
if __name__ == '__main__':
main()
I hope it works. It is some time ago though.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.