训练自定义NER模型

Question

I have been training my NER model on some text and trying to find cities in that with custom entities.我一直在用一些文本训练我的 NER 模型，并试图在其中找到带有自定义实体的城市。

Example:-例子：-

    ('paragraph Designated Offices Party A New York Party B Delaware paragraph pricing source calculation Market Value shall generally accepted pricing source reasonably agreed parties paragraph Spot rate Spot Rate specified paragraph reasonably agreed parties',
  {'entities': [(37, 41, 'DesignatedBankLoc'),(54, 62, 'CounterpartyBankLoc')]})

I am looking for 2 entities here DesignatedBankLoc and CounterpartyBankLoc .我在这里寻找 2 个实体DesignatedBankLoc和CounterpartyBankLoc 。 There can be multiple entities also for individual text.单个文本也可以有多个实体。

currently I am training on 60 rows of data as follows:目前我正在训练 60 行数据，如下所示：

import spacy
import random
def train_spacy(data,iterations):
    TRAIN_DATA = data
    nlp = spacy.blank('en')  # create blank Language class
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)


    # add labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
            # print (ent[2])
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print("Statring iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
            print(losses)
    return nlp


prdnlp = train_spacy(TRAIN_DATA, 100)

My problem is:-我的问题是：-

Model is predicting correct when input is different/same pattern of text contains trained cities.当输入不同/相同模式的文本包含受过训练的城市时，模型预测是正确的。 Model is not predicting for any of the entities even if same/different pattern of text but different cities which never occurs in training data set.即使在训练数据集中从未出现过的相同/不同的文本模式但不同的城市，模型也不会预测任何实体。

Please suggest me why it is happening please make me understand the concept how it is getting train?请告诉我为什么会这样，请让我了解它是如何获得训练的概念？

Answer 1

Based on experience, you have 60 rows of data and train for 100 iterations.根据经验，您有 60 行数据并训练 100 次迭代。 You are overfitting on the value of the entities as opposed to their position.您过度拟合实体的价值而不是它们的位置。

To check this, try to inject the city names at random places in a sentence and see what happens.要检查这一点，请尝试在句子中的随机位置注入城市名称，然后看看会发生什么。 If the algorithm tags them, you're likely overfitting.如果算法标记了它们，则您可能会过度拟合。

There are two solutions:有两种解决方案：

Create more training data with more varied values for these entities为这些实体创建更多具有更多变化值的训练数据
Test for different number of iterations测试不同的迭代次数

训练自定义NER模型

问题描述

1 个解决方案

解决方案1
1 2019-12-04 20:11:01

训练自定义NER模型

问题描述

1 个解决方案

解决方案1 1 2019-12-04 20:11:01

解决方案1
1 2019-12-04 20:11:01