[英]How to convert spaCy NER dataset format to Flair format?
我已經使用 dataturks 標記了一個數據集來訓練spaCy
NER,一切正常,但是,我剛剛意識到Flair
有不同的格式,我只是想知道是否有辦法將我的“spaCy's NER”json 數據集格式轉換為Flair
格式:
喬治 N B-PER
華盛頓 N I-PER
去VO
到采購訂單
華盛頓 N B-LOC
然而 spaCy 的格式如下:
[("喬治華盛頓去了華盛頓",
{'實體': [(0, 6,'PER'),(7, 17,'PER'),(26, 36,'LOC')]})]
Flair
使用BILUO
方案,句子之間有空行,所以你需要使用bliuo_tags_from_offsets
:
import spacy
from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_md")
ents = [("George Washington went to Washington",{'entities': [(0, 6,'PER'),(7, 17,'PER'),(26, 36,'LOC')]}),
("Uber blew through $1 million a week", {'entities':[(0, 4, 'ORG')]}),
]
with open("flair_ner.txt","w") as f:
for sent,tags in ents:
doc = nlp(sent)
biluo = biluo_tags_from_offsets(doc,tags['entities'])
for word,tag in zip(doc, biluo):
f.write(f"{word} {tag}\n")
f.write("\n")
Output:
George U-PER
Washington U-PER
went O
to O
Washington U-LOC
Uber U-ORG
blew O
through O
$ O
1 O
million O
a O
week O
請注意,僅訓練NER
似乎就足夠了。 如果您希望添加 pos 標記,則需要創建從Universal Pos Tags到 Flair 簡化方案的映射。 例如:
tag_mapping = {'PROPN':'N','VERB':'V','ADP':'P','NOUN':'N'} # create your own
with open("flair_ner.txt","w") as f:
for pair in ents:
sent,tags = pair
doc = nlp(sent)
biluo = biluo_tags_from_offsets(doc,tags['entities'])
try:
for word,tag in zip(doc, biluo):
f.write(f"{word} {tag_mapping[word.pos_]} {tag}\n")
# f.write(f"{word} {tag_mapping.get(word.pos_,'None')} {tag}\n")
except KeyError:
print(f"''{word.pos_}' tag is not defined in tag_mapping")
f.write("\n")
Output:
''SYM' tag is not defined in tag_mapping'
spaCy v3.0 中使用的主要數據格式是帶有擴展名.spacy 的二進制格式。 JSON 格式已棄用。 為了將 BILUO 注釋中的 train.spacy 轉換為 fair 格式,我創建了一個語料庫。
import spacy
from spacy.training import Corpus
nlp = spacy.load("de_core_news_sm")
corpus = Corpus("route/to/train.spacy")
data = corpus(nlp)
# Flair supports BIO and BIOES, see https://github.com/flairNLP/flair/issues/875
def rename_biluo_to_bioes(old_tag):
new_tag = ""
try:
if old_tag.startswith("L"):
new_tag = "E" + old_tag[1:]
elif old_tag.startswith("U"):
new_tag = "S" + old_tag[1:]
else:
new_tag = old_tag
except:
pass
return new_tag
def generate_corpus():
corpus = []
n_ex = 0
for example in data:
n_ex += 1
text = example.text
doc = nlp(text)
tags = example.get_aligned_ner()
# Check if it's an empty list of NER tags.
if None in tags:
pass
else:
new_tags = [rename_biluo_to_bioes(tag) for tag in tags]
for token, tag in zip(doc,new_tags):
row = token.text +' '+ token.pos_ +' ' +tag + '\n'
corpus.append(row)
corpus.append('\n')
return corpus
def write_file(filepath):
with open(filepath, 'w', encoding='utf-8') as f:
corpus = generate_corpus()
f.writelines(corpus)
def main():
write_file('./data/train.txt')
if __name__ == '__main__':
main()
我希望它有效。 雖然是前一段時間了。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.