简体繁体 English

使用多处理的 Spacy 训练器

[英]Spacy train ner using multiprocessing

原文 2020-02-28 13:47:48 5 2 python/ nlp/ python-multiprocessing/ spacy/ named-entity-recognition

I am trying to train a custom ner model using spacy.我正在尝试使用 spacy 训练自定义 ner model。 Currently, I have more than 2k records for training and each text consists of more than 100 words, at least more than 2 entities for each record.目前，我有超过 2k 条记录用于训练，每条文本由 100 多个单词组成，每条记录至少有 2 个以上的实体。 I running it for 50 iterations.我运行它 50 次迭代。 It is taking more than 2 hours to train completely.完全训练需要2个多小时。

Is there any way to train using multiprocessing?有什么方法可以使用多处理进行训练吗？ Will it improve the training time?它会改善训练时间吗？

2 个解决方案

Short answer... probably not简短的回答......可能不是

It's very unlikely that you will be able to get this to work for a few reasons:由于以下几个原因，您不太可能让它起作用：

The network being trained is performing iterative optimization正在训练的网络正在执行迭代优化
- Without knowing the results from the batch before, the next batch cannot be optimized不知道之前批次的结果，无法优化下批次
There is only a single network只有一个网络
- Any parallel training would be creating divergent networks...任何并行训练都会创建不同的网络......
- ...which you would then somehow have to merge ...然后你必须以某种方式合并

Long answer... there's plenty you can do!长答案......你可以做很多事情！

There are a few different things you can try however:但是，您可以尝试几种不同的方法：

Get GPU training working if you haven't如果您还没有，请进行GPU 培训
- It's a pain, but can speed up training time a bit这很痛苦，但可以稍微加快训练时间
- It will dramatically lower CPU usage however然而，它会大大降低 CPU 使用率
Try to use spaCy command line tools尝试使用spaCy 命令行工具
- The JSON format is a pain to produce but... JSON 格式很难生成，但是...
- The benefit is you get a well optimised algorithm written by the experts好处是您可以获得专家编写的优化算法
- It can have dramatically faster / better results than hand crafted methods它可以比手工制作的方法获得更快/更好的结果
If you have different entities, you can train multiple specialised networks如果你有不同的实体，你可以训练多个专门的网络
- Each of these may train faster每一个都可以训练得更快
- These networks could be done in parallel to each other (CPU permitting)这些网络可以彼此并行完成（CPU 允许）
Optimise your python and experiment with parameters优化你的python并试验参数
- Speed and quality is very dependent on parameter tweaking (batch size, repetitions etc.)速度和质量非常依赖于参数调整（批量大小、重复次数等）
- Your python implementation providing the batches (make sure this is top notch)您提供批次的 python 实现（确保这是一流的）
Pre-process your examples预处理您的示例
- spaCy NER extraction requires a surprisingly small amount of context to work spaCy NER 提取需要非常少的上下文才能工作
- You could try pre-processing your snippets to contain 10 or 15 surrounding words and see how your time and accuracy fairs你可以尝试预处理你的片段以包含 10 或 15 个周围的单词，看看你的时间和准确性如何

Final thoughts... when is your network "done"?最后的想法......你的网络什么时候“完成”？

I have trained networks with many entities on thousands of examples longer than specified and the long and short is, sometimes it takes time.我已经在数以千计的示例中训练了许多实体的网络，并且长短不一，有时需要时间。

However 90% of the increase in performance is captured in the first 10% of training.然而，90% 的性能提升是在前 10% 的训练中获得的。

Do you need to wait for 50 batches?需要等50批吗？
... or are you looking for a specific level of performance? ... 或者您正在寻找特定级别的性能？

If you monitor the quality every X batches, you can bail out when you hit a pre-defined level of quality.如果您每X批次监控一次质量，您就可以在达到预定义的质量水平时退出。

You can also keep old networks you have trained on previous batches and then "top them up" with new training to get to a level of performance you couldn't by starting from scratch in the same time.您还可以保留您在之前批次上训练过的旧网络，然后用新训练“补充”它们，以达到同时从头开始无法达到的性能水平。

Good luck!祝你好运！

Hi I did same project where I created custom NER Model using spacy3 and extracted 26 entities on large data.嗨，我做了同一个项目，我使用 spacy3 创建了自定义 NER Model，并在大数据上提取了 26 个实体。 See it really depends like how are you passing your data.看到它真的取决于你如何传递你的数据。 Follow the steps I am mentioning below might it could work on CPU:按照我在下面提到的步骤进行操作，它是否可以在 CPU 上运行：

Annotate your text files and save into JSON注释您的文本文件并保存到 JSON
Convert your JSON files into.spacy format because this is the format spacy accepts.将 JSON 文件转换为 .spacy 格式，因为这是 spacy 接受的格式。
Now, here is the point to be noted that how are you passing and serializing your .spacy format in spacy doc object.现在，需要注意的一点是，您如何在 spacy 文档 object 中传递和序列化您的.spacy格式。

Passing all your JSON text will take more time in training.通过所有 JSON 文本将花费更多的训练时间。 So you can split your data and pass iterating it.所以你可以拆分你的数据并通过迭代它。 Don't pass consolidated data.不要传递合并数据。 Split it.拆分它。