简体繁体 English

通过 Apache Kafka 发送和加载 ML 模型

[英]Send and load an ML model over Apache Kafka

原文 2022-05-12 23:58:06 5 1 python/ apache-kafka/ scikit-learn/ pytorch

I've been looking around here and on the Internet, but it seems that I'm the first one having this question.我一直在这里和互联网上四处寻找，但似乎我是第一个有这个问题的人。

I'd like to train an ML model (let's say something with PyTorch) and write it to an Apache Kafka cluster.我想训练一个 ML 模型（让我们用 PyTorch 说点什么）并将其写入 Apache Kafka 集群。 On the other side, there should be the possibility of loading the model again from the received array of bytes.另一方面，应该有可能从接收到的字节数组中再次加载模型。 It seems that almost all the frameworks only offer methods to load from a path, so a file.似乎几乎所有框架都只提供从路径加载的方法，因此是文件。

The only constraint I'm trying to satisfy is to not save the model as a file, so I won't need a storage.我试图满足的唯一约束是不将模型保存为文件，因此我不需要存储空间。

Am I missing something?我错过了什么吗？ Do you have any idea how to solve it?你知道如何解决它吗？

1 个解决方案

One reason to avoid this is that Kafka messages have a default of 1MB max.避免这种情况的一个原因是 Kafka 消息的默认最大值为 1MB。 Therefore sending models around in topics wouldn't be the best idea, and therefore why you could instead use model files, stored in a shared filesystem, and send URIs to the files (strings) to download in the consumer clients.因此，在主题中发送模型并不是最好的主意，因此您可以改为使用模型文件，存储在共享文件系统中，并将URI 发送到文件（字符串）以在消费者客户端中下载。

For small model files, there is nothing preventing you from dumping the Kafka record bytes to a local file, but if you happen to change the model input parameters, then you'd need to edit the consumer code, anyway.对于小型模型文件，没有什么可以阻止您将 Kafka 记录字节转储到本地文件，但是如果您碰巧更改了模型输入参数，那么无论如何您都需要编辑使用者代码。

Or you can embed the models in other stream processing engines (still on local filesystems), as linked in the comments.或者您可以将模型嵌入到其他流处理引擎中（仍然在本地文件系统上），如评论中链接的那样。