简体   繁体   English

在spacy训练数据中基于NER实体标签过滤数据

[英]Filtering data based on NER entity labels in spacy training data

I have an NER training data using Spacy in the following format.我有一个使用 Spacy 的 NER 训练数据,格式如下。

[('Christmas Perot 2021 TSO\nSkip to Main Content HOME CONCERTS EVENTS ABOUT STAFF EDUCATION SUPPORT US More Use tab to navigate through the menu items. BUY TICKETS SUNDAY, DECEMBER 12, 2021 I PEROT THEATRE I 4:00 PM\nPOPS I Christmas at The Perot\nCLICK HERE to purchase tickets, or contact the Texarkana Symphony Orchestra at 870.773.3401\nA Texarkana Tradition Join the TSO, the Texarkana Jazz Orchestra, and the TSO Chamber Singers, for this holiday concert for the whole family.\nDon’t miss seeing the winner of TSO’s 11th Annual Celebrity Conductor Competition\nBack to Events 2019 Texarkana Symphony Orchestra',
  {'entities': [(375, 399, 'organization'),
    (290, 318, 'organization'),
    (220, 242, 'production_name'),
    (169, 186, 'performance_date'),
    (189, 202, 'auditorium'),
    (205, 212, 'performance_starttime'),
    (409, 428, 'organization')]})]

Data is the first element in the tuple.数据是元组中的第一个元素。 Within entities, the numbers represent the character position (start and end) of entities in data.在实体中,数字表示数据中实体的字符位置(开始和结束)。 Some lines do not have any entities.有些线没有任何实体。 For example first line Christmas Perot 2021 TSO do not have any entities.例如,第一行Christmas Perot 2021 TSO没有任何实体。 I need to remove sentences which do not have any entities.我需要删除没有任何实体的句子。 Removing of sentences can be done based in .可以基于. and \\n characters.\\n字符。 I got the entity data based on the character number but i didnt manage to get the removal of sentences which are not tagged我根据字符号获得了实体数据,但我没有设法删除未标记的句子

Code代码

from tqdm import tqdm
import spacy
nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for text, annot in tqdm(train_data): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        print(start,end,span,label)
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents # label the text with the ents

How about this:这个怎么样:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import numpy as np

foo = \
    [('''Christmas Perot 2021 TSO
Skip to Main Content HOME CONCERTS EVENTS ABOUT STAFF EDUCATION SUPPORT US More Use tab to navigate through the menu items. BUY TICKETS SUNDAY, DECEMBER 12, 2021 I PEROT THEATRE I 4:00 PM
POPS I Christmas at The Perot
CLICK HERE to purchase tickets, or contact the Texarkana Symphony Orchestra at 870.773.3401
A Texarkana Tradition Join the TSO, the Texarkana Jazz Orchestra, and the TSO Chamber Singers, for this holiday concert for the whole family.
Don\xe2\x80\x99t miss seeing the winner of TSO\xe2\x80\x99s 11th Annual Celebrity Conductor Competition
Back to Events 2019 Texarkana Symphony Orchestra''',
     {'entities': [
    (375, 399, 'organization'),
    (290, 318, 'organization'),
    (220, 242, 'production_name'),
    (169, 186, 'performance_date'),
    (189, 202, 'auditorium'),
    (205, 212, 'performance_starttime'),
    (409, 428, 'organization'),
    ]})]

sentences = foo[0][0].split('\n')
sentence_lengths = list(map(len, sentences))

cumulative_sentence_length = np.cumsum(sentence_lengths)

pick_indices = set()

for e in foo[0][1]['entities']:
    # only pick the first index (→ second [0])
    idx = np.where(e[0] < cumulative_sentence_length)[0][0]
    pick_indices.add(idx)

print('\n'.join([sentences[i] for i in pick_indices]))

The output is the first, second, third and fourth (= {1, 2, 3, 4} ) sentence.输出是第一、第二、第三和第四(= {1, 2, 3, 4} )句子。 The idea is to这个想法是

  1. split the sentences拆分句子
  2. cumulate the length of the sentence累积句子的长度
  3. check if the entity start index is within range (and exclusively pick the first index)检查实体起始索引是否在范围内(并专门选择第一个索引)
  4. (optional) you can do a sanity check yourself with the end index of your entity (可选)您可以使用实体的结束索引自己进行健全性检查

Have a look at the cumulative_sentence_length variable which holds the value [ 24 211 240 331 472 553 601] which are the upper bounds for the sentence intervals.看一下cumulative_sentence_length变量,它保存了值[ 24 211 240 331 472 553 601] ,这是句子间隔的上限。

As you are dealing with a data science topic I presume that the use of numpy here is no hurdle for you.当您处理数据科学主题时,我认为在这里使用 numpy 对您来说没有障碍。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM