简体   繁体   English

将 .tsv 格式转换为 NER 的 spacy

[英]Converting .tsv format to spacy for NER

I am facing a problem, not that good in coding,I have a tsv file where data looks like this:我面临一个问题,编码不太好,我有一个 tsv 文件,其中数据如下所示:

在此处输入图片说明

lines are separated by a blank line.行由空行分隔。 I have tried using this:我试过用这个:

def load_data_spacy(file_path):
''' Converts data from:
word \t label \n word \t label \n \n word \t label
to: sentence, {entities : [(start, end, label), (stard, end, label)]}
'''
file = open(file_path, 'r')
training_data, entities, sentence, unique_labels = [], [], [], []
current_annotation = None
start =0
end = 0 # initialize counter to keep track of start and end characters
for line in file:
    line = line.strip("\n").split("\t")
    # lines with len > 1 are words
    if len(line) > 1:
        label = line[1]
        if(label != 'O'):
            label = line[1]     # the .txt is formatted: label \t word, label[0:2] = label_type
        #label_type = line[0][0] # beginning of annotations - "B", intermediate - "I"
        word = line[0]
        sentence.append(word)
        start = end
        end += (len(word) + 1)  # length of the word + trailing space
       # lines with len == 1 are breaks between sentences
    if len(line) == 1:
        if(len(entities) > 0):
            sentence = " ".join(sentence)
            training_data.append([sentence, {'entities' : entities}])
        # reset the counters and temporary lists
        end = 0 
        start = 0
        entities, sentence = [], []
        
file.close()
return training_data, unique_labels

But I am unable to get the required spacy format for NER which should look like this:但是我无法获得 NER 所需的 spacy 格式,它应该如下所示:

[["sentence", {'entities': [(start, end, 'tags')]}] [["sentence", {'entities': [(start, end, 'tags')]}]

You don't have to write custom code for this.您不必为此编写自定义代码。 This is one of the formats spaCy can convert directly using the spacy convert command.这是 spaCy 可以使用spacy convert命令直接转换的格式之一。 It's the conll/ner format, so you can just do this:这是conll/ner格式,所以你可以这样做:

spacy convert -c ner myfile.tsv out.spacy

Note that as of v3 spaCy doesn't have a specific JSON format that's recommended, you just need to make Docs that look like the output you want.请注意,从 v3 spaCy 开始,没有推荐的特定 JSON 格式,您只需要制作看起来像您想要的输出的 Docs。 Take a look at the example projects to see code converting various types of data.查看示例项目以查看转换各种类型数据的代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM