简体   繁体   English

SpaCy 自定义 NER Model:依赖解析器训练错误

[英]SpaCy Custom NER Model: Dependency Parser Training Error

I was trying to build a custom NER model using spacy.我正在尝试使用 spacy 构建自定义 NER model。 After building the model for entities, it was necessary to train the model for dependency parsers.在为实体构建 model 之后,有必要为依赖解析器训练 model。 I tried following the sample code provided on the Spacy website given below: https://spacy.io/usage/training#tagger-parser我尝试按照下面给出的 Spacy 网站上提供的示例代码进行操作: https://spacy.io/usage/training#tagger-parser

The sample code for the training data given the SpaCy website is:给定 SpaCy 网站的训练数据的示例代码是:

TRAIN_DATA = [
(
    "They trade mortgage-backed securities.",
    {
        "heads": [1, 1, 4, 4, 5, 1, 1],
        "deps": ["nsubj", "ROOT", "compound", "punct", "nmod", "dobj", "punct"],
    },
)]

In this sample code, for the training data, there is a label called “heads” .在这个示例代码中,对于训练数据,有一个名为“heads”的 label 。 I am not very particular on what it exactly is and what is its significance in the code.我对它到底是什么以及它在代码中的意义不是很特别。

I tried to run the model without the "heads" label in the training data.我尝试在训练数据中运行没有“头”label 的 model。 A sample of the training data is:训练数据的样本是:

TRAIN_PARSER = ('Mr Manjunath who is in-charge of the motor at their Goa location.', {'deps': ['compound',    'ROOT',    'nsubj',    'relcl',    'prep',    'punct',    'pobj',    'prep',    'det',    'pobj',    'prep',    'poss', 'compound','pobj', 'punct']})

When I try to run the model without the heads label given below:当我尝试在没有下面给出的磁头 label 的情况下运行 model 时:

from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding


# training data
TRAIN_DATA = TRAIN_PARSER


@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(model='model1', output_dir='model2', n_iter=74):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
    nlp = spacy.load(model)  # load existing spaCy model
    print("Loaded model '%s'" % model)
else:
    nlp = spacy.blank("en")  # create blank Language class
    print("Created blank 'en' model")

# add the parser to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if "parser" not in nlp.pipe_names:
    parser = nlp.create_pipe("parser")
    nlp.add_pipe(parser, first=True)
# otherwise, get it, so we can add labels to it
else:
    parser = nlp.get_pipe("parser")

# add labels to the parser
for _, annotations in TRAIN_DATA:
    for dep in annotations.get('deps', []):
        parser.add_label(dep)

# get names of other pipes to disable them during training
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes):  # only train parser
    optimizer = nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, losses=losses)
        print("Losses", losses)

# test the trained model
test_text = "I like securities."
doc = nlp(test_text)
print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc])

# save model to output directory
if output_dir is not None:
    output_dir = Path(output_dir)
    if not output_dir.exists():
        output_dir.mkdir()
    nlp.to_disk(output_dir)
    print("Saved model to", output_dir)

    # test the saved model
    print("Loading from", output_dir)
    nlp2 = spacy.load(output_dir)
    doc = nlp2(test_text)
    print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc])

    main(model='model1', output_dir='model2', n_iter=74)

I get the following error:我收到以下错误:

IndexError: list index out of range

Can someone please explain it to me, what the exact problem is over here and how can I solve it?有人可以向我解释一下,这里的确切问题是什么,我该如何解决? Also, how can I generate the "heads" label for my training data?另外,如何为我的训练数据生成“头”label?

The heads information is required to identify what the immediate 'parent' of a token is in the tree.需要heads信息来识别令牌的直接“父级”在树中是什么。 For example, in例如,在

"I like London and Berlin.",
        {
            "heads": [1, 1, 1, 2, 2, 1],
            "deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"],
        },

the word I has the head at index 1, ie the word like , and is connected to it with the dependency nsubj .单词I的头部位于索引 1,即单词like ,并通过依赖关系nsubj连接到它。

More info about that terminology can be found in the spaCy docs: https://spacy.io/usage/linguistic-features#navigating有关该术语的更多信息可以在 spaCy 文档中找到: https://spacy.io/usage/linguistic-features#navigating

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM