简体繁体 English

现实世界的机器学习生产系统如何运行？

[英]How does real world machine learning production systems run?

原文 2018-06-22 05:53:54 3 1 python/ machine-learning/ deployment/ data-science/ production-environment

Dear Machine Learning/AI Community, 亲爱的机器学习/人工智能社区，

I am just a budding and aspiring Machine Learner who has worked on open online data sets and some POC's built locally for my project. 我只是一个崭露头角和有抱负的机器学习者，他致力于开放的在线数据集和一些为我的项目在本地构建的POC。 I have built some models and converted into pickle objects in order to avoid re-training. 我已经建立了一些模型并转换为泡菜对象，以避免重新训练。

And this question always puzzles me. 这个问题总是让我感到困惑。 How does a real production system work for ML algorithms? 实际生产系统如何用于ML算法？

Say, I have trained my ML algorithm with some millions of data and I want to move it to production system or host it on a server. 说，我已经用数百万的数据训练了我的ML算法，并希望将其移至生产系统或将其托管在服务器上。 In real world, do they convert into pickle objects? 在现实世界中，它们会转换为泡菜对象吗？ If so, it would be huge pickled file, isn't. 如果是这样，那将是一个巨大的腌制文件，不是。 The ones I trained locally and converted for 50000 rows data itself took 300 Mb space on disk for that pickled object. 我在本地训练并转换为50000行数据的磁盘本身为该腌制对象占用了300 Mb磁盘空间。 I don't think so this is right approach. 我不认为这是正确的方法。

So how does it work in order to avoid my ML algorithm to re-train and start predicting on incoming data? 那么，如何避免我的ML算法重新训练并开始对传入数据进行预测呢？ And how do we actually make ML algorithm as a continuous online learner. 以及我们如何真正使ML算法成为一个连续的在线学习者。 For example, I built a image classifier, and start predicting the incoming images. 例如，我建立了一个图像分类器，并开始预测传入的图像。 But I want to again train algorithm by adding the incoming online images to my previously trained data sets. 但是我想通过将传入的在线图像添加到我以前训练过的数据集中来再次训练算法。 May be not for every data, but daily once I want to combine all received data for that day and re-train with newly 100 images which my previously trained classifier predicted with actual value. 可能不是每个数据，而是每天一次，我想将当天收到的所有数据合并起来，并用我以前训练过的分类器预测的具有实际价值的新100张图像进行重新训练。 And this approach shouldn't effect my previously trained algorithm to stop predicting incoming data as this re-training may take time based on computational resources and data. 而且这种方法不应影响我以前训练的算法来停止预测输入数据，因为这种重新训练可能会基于计算资源和数据花费时间。

I have Googled and read many articles, but couldn't find or understand to my above question. 我已经在Google上搜索并阅读了许多文章，但找不到或无法理解我的上述问题。 And this is puzzling me every day. 这每天困扰着我。 Do manual intervention is needed for production systems as well? 生产系统也需要人工干预吗？ or any automated approach is there for it? 或有任何自动化的方法吗？

Any leads or answers to above questions would be highly helpful and appreciated. 以上问题的任何线索或答案都将非常有帮助和赞赏。 Please let me know if my questions doesn't make sense or not understandable. 如果我的问题没有道理或无法理解，请告诉我。

This is not a project centric I am looking for. 我不是要以项目为中心。 Just a generic case of real world production ML systems example. 这只是现实世界中生产ML系统示例的一般情况。

Thank you in advance! 先感谢您！

1 个解决方案

Note that this is is very broadly formulated, and your question should be put on hold probably, but I try to give a brief summary of what you are trying to ask: 请注意，这是非常广泛的表述方式，您的问题可能应该搁置，但我尝试简要概述您要提出的问题：

"How does a real production system work?" “真正的生产系统如何工作？”
Well, it always depends on the scale of your product, and in what way you are using ML/AI in your system. 嗯，这始终取决于产品的规模以及您在系统中使用ML / AI的方式。 For the most parts, you would deploy a model on your server or app. 大多数情况下，您将在服务器或应用程序上部署模型。
Note that deployment does NOT lineraly scale with the amount of training data you have. 请注意，部署不 lineraly你所拥有的训练数据量规模。 Rather, the size of your network is solely determined by the number of activations in your network. 而是，网络的大小完全取决于网络中激活的次数。 Note that, after training, you might not even need as much storage space, since for example CNNs have a very limited number of connections, while retaining a much larger number during training. 请注意，在训练之后，您甚至可能不需要那么多的存储空间，因为例如CNN的连接数量非常有限，而在训练过程中却保留了很多连接。 I can highly recommend Roger Grosse's slides on the size of a network . 我可以高度推荐Roger Grosse关于网络规模的幻灯片。 This also directly relates to the second point. 这也直接关系到第二点。
"How to avoid re-training?" “如何避免再培训？”
From what I am aware of, most systems will not be retrained on a regular basis, at least for the smaller scale. 据我了解，大多数系统都不会定期进行培训，至少对于较小规模的系统而言。 This means that a network will mostly run in inference mode only, which has the aforementioned benefit of what I mentioned about the size of the network (and the time it takes to compute a result). 这意味着网络将仅在推理模式下运行，这具有我提到的有关网络规模（以及计算结果所花费的时间）的上述好处。 Then again, this also highly depends on the specific task for which you are deploying a ML model. 再者，这也高度取决于您要为其部署ML模型的特定任务。 Image classification on "standard categories" have the benefit of already delivering quite substantial models (AlexNet, Inception, ResNet,...), whereas a model for machine translation mostly depends on your specific domain and vocabulary. 在“标准类别”上进行图像分类的好处是已经提供了相当丰富的模型（AlexNet，Inception，ResNet等），而机器翻译的模型主要取决于您的特定领域和词汇。
"How would I go about re-training?" “我将如何进行再培训？”
This is actually the tricky part, which has a significant field called "bandit learning" behind it. 这实际上是棘手的部分，其背后有一个重要的领域，称为“强盗学习”。 The problem is that most of your incoming "new" data will be unlabeled, ie cannot be used for the direct integration into a new training phase. 问题在于，大多数传入的“新”数据将没有标签，即不能用于直接集成到新的培训阶段。 Instead, you rely on feedback from users to give you a sense of what was wrong or right. 相反，您依靠用户的反馈来了解错误或正确的地方。 Then again, not every user has the same ratings for the same machine translation (or same recommendations on Amazon etc.), for example, so judging whether your system is "right" or "wrong" becomes very hard. 再一次，例如，并非每个用户对相同的机器翻译（或在Amazon上的相同建议等）都具有相同的评级，因此很难判断您的系统是“对”还是“错”。
There are obviously quite a few methods to automate labeling (ie nearest neighbor for images, or other similarity-based searches). 显然，有很多方法可以使标签自动化（例如，图像的最近邻居或其他基于相似度的搜索）。 Online learning therefore only also works if you have this continuous loop of feedback/retraining. 因此，在线学习仅在您具有反馈/再培训这样的连续循环时才有效。

For larger scale systems, it also becomes important to scale your models, to perform the desired amount of predictions/classifications per second. 对于较大规模的系统，缩放模型，执行每秒所需的预测/分类数量也很重要。 This is also mentioned in the link to the TensorFlow deployment page I provided, and mainly builds on top of cloud/distributed architectures, such as Hadoop or (more recently) Kubernetes. 我提供的TensorFlow部署页面的链接中也提到了这一点，该页面主要基于云/分布式架构（例如Hadoop或（最近）Kubernetes）构建。 Then again, for smaller products this is mostly overkill, but serves the purpose of delivering enough resources at any arbitrary scale (and possibly on demand). 再者，对于较小的产品，这通常是过大的，但其目的是以任意规模（并可能按需）提供足够的资源。

As for the integration cycle of machine learning models, there is a nice overview in this article . 至于机器学习模型的集成周期，有一个很好的概述这篇文章。 I want to conclue by stressing that this is a heavily opinionated question, so every answer might be different! 我想通过强调这是一个很自以为是的问题来总结一下，因此每个答案都可能不同！