简体繁体 English

Docker 设计：在容器之间交换数据还是将多个进程放在一个容器中？

[英]Docker design: exchange data between containers or put multiple processes in one container?

原文 2019-03-06 10:50:51 1 2 docker/ architecture

In a current project I have to perform the following tasks (among others):在当前项目中，我必须执行以下任务（除其他外）：

capture video frames from five IP cameras and stitch a panorama从五个 IP 摄像机捕获视频帧并拼接全景图
run machine learning based object detection on the panorama在全景图上运行基于机器学习的对象检测
stream the panorama so it can be displayed in a UI流式传输全景图，以便它可以显示在 UI 中

Currently, the stitching and the streaming runs in one docker container, and the object detection runs in another, reading the panorama stream as input.目前，拼接和流式传输在一个 docker 容器中运行，对象检测在另一个容器中运行，读取全景流作为输入。

Since I need to increase the input resolution for the the object detector while maintaining the stream resolution for the UI, I have to look for alternative ways of getting the stitched (full resolution) panorama (~10 MB per frame) from the stitcher container to the detector container.由于我需要在保持 UI 的流分辨率的同时增加对象检测器的输入分辨率，因此我必须寻找其他方法来从拼接器容器获取拼接（全分辨率）全景（每帧约 10 MB）到探测器容器。

My thoughts regarding potential solutions:我对潜在解决方案的看法：

shared volume.共享卷。 Potential downside: One extra write and read per frame might be too slow?潜在的缺点：每帧额外的一次写入和读取可能太慢了？
Using a message queue or eg redis.使用消息队列或例如 redis。 Potential downside: yet another component in the architecture.潜在的缺点：架构中的另一个组件。
merging the two containers.合并两个容器。 Potential downside(s): Not only does it not feel right, but the two containers have completely different base images and dependencies.潜在的缺点：不仅感觉不对，而且两个容器具有完全不同的基础镜像和依赖项。 Plus I'd have to worry about parallelization.另外，我不得不担心并行化。

Since I'm not the sharpest knife in the docker drawer, what I'm asking for are tips, experiences and best practices regarding fast data exchange between docker containers.由于我不是 docker 抽屉里最锋利的刀，我所要求的是有关 docker 容器之间快速数据交换的技巧、经验和最佳实践。

2 个解决方案

Usually most communication between Docker containers is over network sockets.通常 Docker 容器之间的大多数通信是通过网络套接字进行的。 This is fine when you're talking to something like a relational database or an HTTP server.当您与诸如关系数据库或 HTTP 服务器之类的东西交谈时，这很好。 It sounds like your application is a little more about sharing files, though, and that's something Docker is a little less good at.不过，听起来您的应用程序更多的是关于共享文件，而 Docker 不太擅长这一点。

If you only want one copy of each component, or are still actively developing the pipeline: I'd probably not use Docker for this.如果您只想要每个组件的一个副本，或者仍在积极开发管道：我可能不会为此使用 Docker。 Since each container has an isolated filesystem and its own user ID space, sharing files can be unexpectedly tricky (every container must agree on numeric user IDs).由于每个容器都有一个独立的文件系统和自己的用户 ID 空间，因此共享文件可能会出乎意料地棘手（每个容器必须就数字用户 ID 达成一致）。 But if you just run everything on the host, as the same user, pointing at the same directory, this isn't a problem.但是如果你只是在主机上运行所有东西，作为同一个用户，指向同一个目录，这不是问题。

If you're trying to scale this in production: I'd add some sort of shared filesystem and a message queueing system like RabbitMQ.如果你想在生产中扩展它：我会添加某种共享文件系统和一个消息队列系统，比如 RabbitMQ。 For local work this could be a Docker named volume or bind-mounted host directory;对于本地工作，这可能是一个 Docker 命名的卷或绑定安装的主机目录； cloud storage like Amazon S3 will work fine too.像 Amazon S3 这样的云存储也可以正常工作。 The setup is like this:设置是这样的：

Each component knows about the shared storage and connects to RabbitMQ, but is unaware of the other components.每个组件都知道共享存储并连接到 RabbitMQ，但不知道其他组件。
Each component reads a message from a RabbitMQ queue that names a file to process.每个组件从 RabbitMQ 队列中读取一条消息，该队列命名一个要处理的文件。
The component reads the file and does its work.该组件读取文件并完成其工作。
When it finishes, the component writes the result file back to the shared storage, and writes its location to a RabbitMQ exchange.完成后，组件将结果文件写回共享存储，并将其位置写入 RabbitMQ 交换。

In this setup each component is totally stateless.在这个设置中，每个组件都是完全无状态的。 If you discover that, for example, the machine-learning component of this is slowest, you can run duplicate copies of it.例如，如果您发现它的机器学习组件最慢，您可以运行它的重复副本。 If something breaks, RabbitMQ will remember that a given message hasn't been fully processed (acknowledged);如果出现问题，RabbitMQ 会记住给定的消息还没有被完全处理（确认）； and again because of the isolation you can run that specific component locally to reproduce and fix the issue.再次由于隔离，您可以在本地运行该特定组件以重现和修复问题。

This model also translates well to larger-scale Docker-based cluster-computing systems like Kubernetes.该模型还可以很好地转换为更大规模的基于 Docker 的集群计算系统，如 Kubernetes。

Running this locally, I would absolutely keep separate concerns in separate containers (especially if individual image-processing and ML tasks are expensive).在本地运行它，我绝对会在单独的容器中保留单独的关注点（特别是如果单独的图像处理和 ML 任务很昂贵）。 The setup I propose needs both a message queue (to keep track of the work) and a shared filesystem (because message queues tend to not be optimized for 10+ MB individual messages).我建议的设置需要一个消息队列（以跟踪工作）和一个共享文件系统（因为消息队列往往不会针对 10 MB 以上的单个消息进行优化）。 You get a choice between Docker named volumes and host bind-mounts as readily available shared storage.您可以在 Docker 命名卷和主机绑定挂载之间进行选择，作为现成的共享存储。 Bind mounts are easier to inspect and administer, but on some platforms are legendarily slow.绑定挂载更容易检查和管理，但在某些平台上速度非常慢。 Named volumes I think are reasonably fast, but you can only access them from Docker containers, which means needing to launch more containers to do basic things like backup and pruning.我认为命名卷相当快，但您只能从 Docker 容器访问它们，这意味着需要启动更多容器来执行备份和修剪等基本操作。

Alright, Let's unpack this:好吧，让我们解开这个：

IMHO Shared Volume works just fine, but gets way too messy over time.恕我直言，共享卷工作得很好，但随着时间的推移变得过于混乱。 Especially if you're handling Stateful services.特别是如果您正在处理有状态服务。
MQ: This seems like a best option in my opinion. MQ：在我看来，这似乎是最好的选择。 Yes, it's another component in your architecture, but it makes sense to have it rather than maintaining messy shared Volumes or handling massive container images (if you manage to combine 2 container images)是的，它是您架构中的另一个组件，但拥有它而不是维护凌乱的共享卷或处理大量容器映像（如果您设法组合 2 个容器映像）是有意义的
Yes, You could potentially do this, but not a good idea.是的，您可能会这样做，但不是一个好主意。 Considering your use case, I'm going to go ahead and make an assumption that you have a massive list of dependencies which could potentially lead to a conflict.考虑到您的用例，我将继续假设您有大量可能导致冲突的依赖项。 Also, lot of dependencies = larger image = Larger attack surface - which from a security perspective is not a good thing.此外，大量依赖 = 更大的图像 = 更大的攻击面 - 从安全角度来看，这不是一件好事。

If you really want to run multiple processes in one container, it's possible.如果你真的想在一个容器中运行多个进程，这是可能的。 There are multiple ways to achieve that, however I prefer supervisord .有多种方法可以实现这一点，但我更喜欢supervisord 。

https://docs.docker.com/config/containers/multi-service_container/ https://docs.docker.com/config/containers/multi-service_container/