简体   繁体   English

这个 for 循环如何在 Spacy 的自定义 NER 训练代码中工作?

[英]How does this for loop work in Spacy's custom NER training code?

I am writing a code to train custom entities in Spacy's NER engine.我正在编写代码来训练 Spacy 的 NER 引擎中的自定义实体。 I am stuck in understanding a small part of the code from an online tutorial.我一直在理解在线教程中的一小部分代码。 Here's a link to the tutorial .是教程的链接 The following is the code, I am stuck understanding the two for loops under the comment # add labels .以下是代码,我无法理解注释# add labels下的两个 for 循环。 I am new to python.我是 python 的新手。

import spacy
################### Train Spacy NER.###########
def train_spacy():
    TRAIN_DATA = convert_dataturks_to_spacy("dataturks_downloaded.json");
    nlp = spacy.blank('en')  # create blank Language class
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)

    # add labels
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get('entities'):
            ner.add_label(ent[2])

Apparently, this for loop is adding custom labels to the NER.显然,这个 for 循环正在向 NER 添加自定义标签。 My questions are;我的问题是;

  1. What is an 'annotations', what is its data type?什么是“注释”,它的数据类型是什么? (I googled for 'spacy annotation' but couldn't find the answer) (我用谷歌搜索了“spacy annotation”,但找不到答案)
  2. Why are there two variables to the left of 'in', ('_' and 'annotation') ?为什么'in'左侧有两个变量,('_'和'annotation')
  3. What does ent[2] return? ent[2] 返回什么? What's at pos 2? pos 2 是什么?

Your questions can mostly be answered by understanding the function convert_dataturks_to_spacy .您的问题大多可以通过了解 function convert_dataturks_to_spacy来回答。 The code for this is in the same repo as the tutorial you are following.代码与您正在遵循的教程位于同一存储库中。

  1. The function returns a list of tuples where each tuple is made up of (text, {"entities": entities}) . function 返回一个元组列表,其中每个元组由(text, {"entities": entities}) annotations are the second element of each tuple. annotations是每个元组的第二个元素。
  2. Assigning multiple variables from an output is called tuple unpacking.从 output 分配多个变量称为元组解包。 Basically the for loop is saying for each tuple in training data assign the first element of the tuple to _ and the second element to annotations and then do some stuff.基本上,for循环是说对于训练数据中的每个元组,将元组的第一个元素分配给_ ,将第二个元素分配给annotations ,然后做一些事情。 In python _ is often used as a throw-away variable ie something that isn't used elsewhere in the code but exists in your data.在 python 中, _经常用作一次性变量,即代码中其他地方未使用但存在于数据中的变量。
  3. ent[2] is the label of the entity being tagged. ent[2]是被标记实体的 label。 Looking at the code , an entity in dataturks is tuple with 3 elements - the start position in the string, the end position in the string and the label. 查看代码,dataturks 中的实体是具有 3 个元素的元组 - 字符串中的开头 position、字符串中的结尾 position 和 label。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM