简体   繁体   English

NER 训练循环中的损失不会在空间中减少

[英]Losses in NER training loop not decreasing in spacy

I am trying to train a new entity type 'HE INST'--to recognize colleges.我正在尝试训练一种新的实体类型“HE INST”——以识别大学。 That is the only new label.那是唯一的新标签。 I have a long document as raw text.我有一个很长的文档作为原始文本。 I ran NER on it and saved the entities to the TRAIN DATA and then added the new entity labels to the TRAIN_DATA( i replaced in places where there was overlap).我在其上运行 NER 并将实体保存到 TRAIN DATA 中,然后将新实体标签添加到 TRAIN_DATA(我在有重叠的地方进行了替换)。

The training loop is constant at a loss value(~4000 for all the 15 texts) and (~300) for a single data.训练循环在损失值(所有 15 个文本约 4000)和单个数据(约 300)处保持不变。 Why does this happen, how do I train the model properly.为什么会发生这种情况,我如何正确训练模型。 I have around 18 texts with 40 annotated new entities.Even after all iterations, the model still doesn't predict the output correctly.我有大约 18 个文本和 40 个带注释的新实体。即使在所有迭代之后,模型仍然不能正确预测输出。

I haven't changed the script much.我对剧本的改动不大。 Just added en_core_web_lg, the new label and my TRAIN_DATA刚刚添加了 en_core_web_lg、新标签和我的 TRAIN_DATA

I am trying to tag institutes from resume(CV) data:我正在尝试从简历(CV)数据中标记机构:

This would be one of my text in TRAIN_DATA: (soory for the long text) I have around 18 such texts concantenated to form TRAIN_DATA这将是我在 TRAIN_DATA 中的文本之一:(对于长文本来说太糟糕了)我有大约 18 个这样的文本连接起来形成 TRAIN_DATA

[("To perform better in my work each day. To increase my knowledge. To bring out my best by hardworking and improving my skills. To serve my parents and my family. To contribute my skills to my country. Marital ; Single Status Nationality \xe2\x80\x94: Indian Known . Parr . English, Malayalam, Hindi, Tamil Languages Hobby Playing cricket and football, Listening to music, Movies, Games. Father's ; V.N. Balappan Nair Name Mother's ; Saraswathy B Nair Name Believers Church Caarmel Engineering College R-Perunad Btech Electronics and communication engineering 6.09(Upto S6) 2015 - 2019 Marthoma Senior Secondary School Kozhencherry All India Senior School Certificate Examination 75% 2014 - 2015 Marthoma Senior Secondary School Kozhencherry Secondary School Examination 8.2 2012 - 2013 s@ INTERESTS Electronics, Sports s@ PERSONAL STRENGTHS Hardworking Loyal Good Team Spirit Good in mathematics ees IAA eM LANL NUL e (2 Problem Solving Skills rg DUS \\ TRAININGS completed the Vocational Industrial Training on Long Distance Communication Systems conducted by Southern Telecom Region, Bharat Sanchar Nigam Limited. Completed the internship training in Power Electronics Group(PEG), Tool Room, Fabrication Shop, Transform Winding, Electro Plating, Security And Surveillance Group(SSG), Special Products Group(SPG), Search And Rescue Beacon(SRB), Intelligent Tracking and Communication Project and Technology Development Center of Keltron Equipment Complex, Thiruvananthapuram. PROJECTS Final Year Project: Life Detection Using Quadcopter This project is useful at the time of natural calamities like flood earthquake etc... And can also be used in military applications as this device detects life signals using a PIR sensor and a thermal sensor. The components used in this are: PIR sensor, Thermal sensor, Arduino Nano, BEC, ESC, Quadcopter. Design project: Wireless Power Bank Wireless Power Bank enables us to charge our phone wordlessly. It can charge a device which is kept 10m(maximum) away from the adaptor without any obstacles in between. It uses the IR technology for power transmission. ACHIEVEMENTS & AWARDS Participated in Pecardio Debugging Conducted as a part of NAKSHATRA 2019, The Annual National Level Techno Cultural Fest held at Saingits College of Engineering, kottayam. Volunteered in Alexa One day workshop on Artificial intelligence. Completed a period of two year tenue with a total of 240 hours in the National Service Scheme activities and has attended NSS Annual Special Camp. Participant in Cricket and football at the Annual Sports Meets. DECLARATION do here by confirm that the information given in this form is true to the best of my knowledge and belief.", {'entities': [(29, 37, 'DATE'), (210, 223, 'ORG'), (241, 247, 'NORP'), (256, 260, 'PERSON'), (263, 270, 'LANGUAGE'), (272, 281, 'PERSON'), (283, 288, 'PERSON'), (290, 295, 'NORP'), (362, 375, 'EVENT'), (388, 401, 'PERSON'), (402, 420, 'PERSON'), (423, 445, 'PERSON'), (446, 490, 'HE INST'), (563, 574, 'DATE'), (575, 620, 'ORG'), (625, 668, 'ORG'), (669, 672, 'PERCENT'), (673, 684, 'DATE'), (685, 717, 'ORG'), (764, 775, 'DATE'), (779, 800, 'ORG'), (890, 893, 'ORG'), (909, 910, 'CARDINAL'), (963, 997, 'ORG'), (1001, 1036, 'ORG'), (1050, 1073, 'ORG'), (1075, 1103, 'ORG'), (1142, 1169, 'ORG'), (1172, 1181, 'ORG'), (1183, 1199, 'ORG'), (1201, 1218, 'ORG'), (1220, 1235, 'ORG'), (1275, 1301, 'ORG'), (1304, 1332, 'ORG'), (1335, 1355, 'ORG'), (1360, 1415, 'ORG'), (1419, 1444, 'ORG'), (1446, 1464, 'LOC'), (1475, 1494, 'EVENT'), (1797, 1809, 'GPE'), (1811, 1814, 'GPE'), (1816, 1819, 'ORG'), (1821, 1831, 'ORG'), (1849, 1888, 'ORG'), (1969, 1980, 'CARDINAL'), (2050, 2052, 'ORG'), (2088, 2122, 'ORG'), (2126, 2154, 'ORG'), (2168, 2182, 'EVENT'), (2188, 2194, 'DATE'), (2239, 2270, 'HE INST'), (2297, 2302, 'GPE'), (2303, 2310, 'DATE'), (2358, 2369, 'DATE'), (2370, 2378, 'DATE'), (2401, 2410, 'TIME'), (2414, 2441, 'ORG'), (2470, 2493, 'ORG'), (2534, 2557, 'EVENT')]})]

The script is given below: (Note:- eval function is used to parse the TRAIN_DATA to list after reading it as string from text file-----you most probably know that but just in case)脚本如下:(注意:- eval 函数用于在将 TRAIN_DATA 作为字符串从文本文件中读取后将其解析为列表-----您很可能知道,但以防万一)

from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
import en_core_web_lg
from spacy.util import minibatch, compounding


# new entity label
LABEL = "HE INST"

with open('train_dump-backup.txt', 'r') as i_file:
    t_data = i_file.read()
TRAIN_DATA=eval(t_data)

@plac.annotations(
    model=("en_core_web_lg", "option", "m", str),
    new_model_name=("NLP_INST", "option", "nm", str),
    output_dir=("/home/drbinu/Downloads/NLP_INST", "option", "o", Path),
    n_iter=("30", "option", "n", int),
)

def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    ner.add_label(LABEL)  # add new entity label to entity recognizer
    # Adding extraneous labels shouldn't mess anything up
    ner.add_label("VEGETABLE")
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            batches = minibatch(TRAIN_DATA, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    # test the trained model
    test_text = "B.Tech from Believers Church Caarmel Engineering College CGPA of 8.9"
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)


if __name__ == "__main__":
    plac.call(main)

Losses appear to be increasing because pipeline components increment the loss as part of the update step:损失似乎在增加,因为作为更新步骤的一部分,管道组件会增加损失:

https://github.com/explosion/spaCy/blob/ae4af52ce7dd9dda0eb0f1b8eeb0cba7d20facdf/spacy/pipeline/pipes.pyx#L989 https://github.com/explosion/spaCy/blob/ae4af52ce7dd9dda0eb0f1b8eeb0cba7d20facdf/spacy/pipeline/pipes.pyx#L989

At the start of each epoch, you may want to snapshot the total cumulative loss;在每个 epoch 开始时,您可能希望对总累积损失进行快照; at the end of the epoch, you could compute the average loss over the data observed.在 epoch 结束时,您可以计算观察到的数据的平均损失。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM