[英]Converting .tsv format to spacy for NER
我面臨一個問題,編碼不太好,我有一個 tsv 文件,其中數據如下所示:
行由空行分隔。 我試過用這個:
def load_data_spacy(file_path):
''' Converts data from:
word \t label \n word \t label \n \n word \t label
to: sentence, {entities : [(start, end, label), (stard, end, label)]}
'''
file = open(file_path, 'r')
training_data, entities, sentence, unique_labels = [], [], [], []
current_annotation = None
start =0
end = 0 # initialize counter to keep track of start and end characters
for line in file:
line = line.strip("\n").split("\t")
# lines with len > 1 are words
if len(line) > 1:
label = line[1]
if(label != 'O'):
label = line[1] # the .txt is formatted: label \t word, label[0:2] = label_type
#label_type = line[0][0] # beginning of annotations - "B", intermediate - "I"
word = line[0]
sentence.append(word)
start = end
end += (len(word) + 1) # length of the word + trailing space
# lines with len == 1 are breaks between sentences
if len(line) == 1:
if(len(entities) > 0):
sentence = " ".join(sentence)
training_data.append([sentence, {'entities' : entities}])
# reset the counters and temporary lists
end = 0
start = 0
entities, sentence = [], []
file.close()
return training_data, unique_labels
但是我無法獲得 NER 所需的 spacy 格式,它應該如下所示:
[["sentence", {'entities': [(start, end, 'tags')]}]
您不必為此編寫自定義代碼。 這是 spaCy 可以使用spacy convert
命令直接轉換的格式之一。 這是conll/ner
格式,所以你可以這樣做:
spacy convert -c ner myfile.tsv out.spacy
請注意,從 v3 spaCy 開始,沒有推薦的特定 JSON 格式,您只需要制作看起來像您想要的輸出的 Docs。 查看示例項目以查看轉換各種類型數據的代碼。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.