简体   繁体   English

如何使用泡菜文件进行 Spacy NER model 培训?

[英]How to use a pickle file for Spacy NER model training?

I have data for fine-tuning in this format:我有这种格式的微调数据:

[[(('Kaweah', 'NNP'), 'O'),
  (('Delta', 'NNP'), 'O'),
  (('Mental', 'NNP'), 'O'),
  (('Health', 'NNP'), 'O'),
  (('Hospital', 'NNP'), 'O'),
  (('D/p', 'NNP'), 'O'),
  (('Aph', 'NNP'), 'O'),
  (('is', 'VBZ'), 'O'),
  (('located', 'VBN'), 'O'),
  (('at', 'IN'), 'O'),
  (('1100', 'CD'), 'B-GPE'),
  (('SO', 'NNP'), 'I-GPE'),
  (('.', '.'), 'I-GPE'),
  (('AKERS', 'NNP'), 'I-GPE'),
  (('STREET', 'NNP'), 'I-GPE')],
 [(('CHARLTON', 'NNP'), 'O'),
  (('MEMORIAL', 'NNP'), 'O'),
  (('HOSPITAL', 'NNP'), 'O'),
  (('is', 'VBZ'), 'O'),
  (('located', 'VBN'), 'O'),
  (('at', 'IN'), 'O'),
  (('2449', 'CD'), 'B-GPE'),
  (('THIRD', 'NNP'), 'I-GPE'),
  (('STREET', 'NNP'), 'I-GPE'),
  ((',', ','), 'I-GPE'),
  (('GA', 'NNP'), 'I-GPE')]]

But spacy training format is looking like this:但是 spacy 训练格式是这样的:

TRAIN_DATA =[ ("Pizza is a common fast food.", {"entities": [(0, 5, "FOOD")]}),
              ("Pasta is an italian recipe", {"entities": [(0, 5, "FOOD")]}) ]

What should I do to convert my pickle file to the spacy format?我应该怎么做才能将我的泡菜文件转换为 spacy 格式?

You can just take each token and concatenate whitespace on them to get the text input and then calculate the start and stop for the spans of each entity label.您可以获取每个标记并在它们上连接空格以获取文本输入,然后计算每个实体 label 的跨度的开始和停止。

but you will be creating artificial training data since you don't know what whitespace was used around each token.但是您将创建人工训练数据,因为您不知道每个标记周围使用了哪些空格。 That information is lost in CoNLL format.该信息以 CoNLL 格式丢失。

Therefore, the NER model you train will not be robust since it doesn't know how to label tokens as it will learn to always expect the whitespace concatenation of your choice .因此,您训练的 NER model不会很健壮,因为它不知道如何使用 label 令牌,因为它会学习始终期望您选择的空白连接

Basically, can't be done , you lost information in CoNLL format which cant be recuperated.基本上,做不到,您丢失了无法恢复的 CoNLL 格式的信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM