这个 for 循环如何在 Spacy 的自定义 NER 训练代码中工作？

Question

我正在编写代码来训练 Spacy 的 NER 引擎中的自定义实体。 我一直在理解在线教程中的一小部分代码。 这是教程的链接。 以下是代码，我无法理解注释# add labels下的两个 for 循环。 我是 python 的新手。

import spacy
################### Train Spacy NER.###########
def train_spacy():
    TRAIN_DATA = convert_dataturks_to_spacy("dataturks_downloaded.json");
    nlp = spacy.blank('en')  # create blank Language class
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)

    # add labels
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get('entities'):
            ner.add_label(ent[2])

显然，这个 for 循环正在向 NER 添加自定义标签。 我的问题是；

什么是“注释”，它的数据类型是什么？ （我用谷歌搜索了“spacy annotation”，但找不到答案）
为什么'in'左侧有两个变量，（'_'和'annotation'） ？
ent[2] 返回什么？ pos 2 是什么？

Answer 1

您的问题大多可以通过了解 function convert_dataturks_to_spacy来回答。 此代码与您正在遵循的教程位于同一存储库中。

function 返回一个元组列表，其中每个元组由(text, {"entities": entities}) 。 annotations是每个元组的第二个元素。
从 output 分配多个变量称为元组解包。 基本上，for循环是说对于训练数据中的每个元组，将元组的第一个元素分配给_ ，将第二个元素分配给annotations ，然后做一些事情。 在 python 中， _经常用作一次性变量，即代码中其他地方未使用但存在于数据中的变量。
ent[2]是被标记实体的 label。 查看代码，dataturks 中的实体是具有 3 个元素的元组 - 字符串中的开头 position、字符串中的结尾 position 和 label。

这个 for 循环如何在 Spacy 的自定义 NER 训练代码中工作？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-01-07 08:53:14

这个 for 循环如何在 Spacy 的自定义 NER 训练代码中工作？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-01-07 08:53:14

解决方案1
1 已采纳 2021-01-07 08:53:14