[英]Converting .tsv format to spacy for NER
I am facing a problem, not that good in coding,I have a tsv file where data looks like this:我面临一个问题,编码不太好,我有一个 tsv 文件,其中数据如下所示:
lines are separated by a blank line.行由空行分隔。 I have tried using this:
我试过用这个:
def load_data_spacy(file_path):
''' Converts data from:
word \t label \n word \t label \n \n word \t label
to: sentence, {entities : [(start, end, label), (stard, end, label)]}
'''
file = open(file_path, 'r')
training_data, entities, sentence, unique_labels = [], [], [], []
current_annotation = None
start =0
end = 0 # initialize counter to keep track of start and end characters
for line in file:
line = line.strip("\n").split("\t")
# lines with len > 1 are words
if len(line) > 1:
label = line[1]
if(label != 'O'):
label = line[1] # the .txt is formatted: label \t word, label[0:2] = label_type
#label_type = line[0][0] # beginning of annotations - "B", intermediate - "I"
word = line[0]
sentence.append(word)
start = end
end += (len(word) + 1) # length of the word + trailing space
# lines with len == 1 are breaks between sentences
if len(line) == 1:
if(len(entities) > 0):
sentence = " ".join(sentence)
training_data.append([sentence, {'entities' : entities}])
# reset the counters and temporary lists
end = 0
start = 0
entities, sentence = [], []
file.close()
return training_data, unique_labels
But I am unable to get the required spacy format for NER which should look like this:但是我无法获得 NER 所需的 spacy 格式,它应该如下所示:
[["sentence", {'entities': [(start, end, 'tags')]}] [["sentence", {'entities': [(start, end, 'tags')]}]
You don't have to write custom code for this.您不必为此编写自定义代码。 This is one of the formats spaCy can convert directly using the
spacy convert
command.这是 spaCy 可以使用
spacy convert
命令直接转换的格式之一。 It's the conll/ner
format, so you can just do this:这是
conll/ner
格式,所以你可以这样做:
spacy convert -c ner myfile.tsv out.spacy
Note that as of v3 spaCy doesn't have a specific JSON format that's recommended, you just need to make Docs that look like the output you want.请注意,从 v3 spaCy 开始,没有推荐的特定 JSON 格式,您只需要制作看起来像您想要的输出的 Docs。 Take a look at the example projects to see code converting various types of data.
查看示例项目以查看转换各种类型数据的代码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.