[英]Spacy train ner using multiprocessing
I am trying to train a custom ner model using spacy.我正在尝试使用 spacy 训练自定义 ner model。 Currently, I have more than 2k records for training and each text consists of more than 100 words, at least more than 2 entities for each record.目前,我有超过 2k 条记录用于训练,每条文本由 100 多个单词组成,每条记录至少有 2 个以上的实体。 I running it for 50 iterations.我运行它 50 次迭代。 It is taking more than 2 hours to train completely.完全训练需要2个多小时。
Is there any way to train using multiprocessing?有什么方法可以使用多处理进行训练吗? Will it improve the training time?它会改善训练时间吗?
It's very unlikely that you will be able to get this to work for a few reasons:由于以下几个原因,您不太可能让它起作用:
There are a few different things you can try however:但是,您可以尝试几种不同的方法:
I have trained networks with many entities on thousands of examples longer than specified and the long and short is, sometimes it takes time.我已经在数以千计的示例中训练了许多实体的网络,并且长短不一,有时需要时间。
However 90% of the increase in performance is captured in the first 10% of training.然而,90% 的性能提升是在前 10% 的训练中获得的。
If you monitor the quality every X
batches, you can bail out when you hit a pre-defined level of quality.如果您每X
批次监控一次质量,您就可以在达到预定义的质量水平时退出。
You can also keep old networks you have trained on previous batches and then "top them up" with new training to get to a level of performance you couldn't by starting from scratch in the same time.您还可以保留您在之前批次上训练过的旧网络,然后用新训练“补充”它们,以达到同时从头开始无法达到的性能水平。
Good luck!祝你好运!
Hi I did same project where I created custom NER Model using spacy3 and extracted 26 entities on large data.嗨,我做了同一个项目,我使用 spacy3 创建了自定义 NER Model,并在大数据上提取了 26 个实体。 See it really depends like how are you passing your data.看到它真的取决于你如何传递你的数据。 Follow the steps I am mentioning below might it could work on CPU:按照我在下面提到的步骤进行操作,它是否可以在 CPU 上运行:
Annotate your text files and save into JSON注释您的文本文件并保存到 JSON
Convert your JSON files into.spacy format because this is the format spacy accepts.将 JSON 文件转换为 .spacy 格式,因为这是 spacy 接受的格式。
Now, here is the point to be noted that how are you passing and serializing your .spacy
format in spacy doc object.现在,需要注意的一点是,您如何在 spacy 文档 object 中传递和序列化您的.spacy
格式。
Passing all your JSON text will take more time in training.通过所有 JSON 文本将花费更多的训练时间。 So you can split your data and pass iterating it.所以你可以拆分你的数据并通过迭代它。 Don't pass consolidated data.不要传递合并数据。 Split it.拆分它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.