来自各种 docker 镜像的同一个文件会在 k8s 节点中被页面缓存一次吗？

Question

Excerpt fromhttps://docs.docker.com/storage/storagedriver/overlayfs-driver/摘自https://docs.docker.com/storage/storagedriver/overlayfs-driver/

Page Caching.页面缓存。 OverlayFS supports page cache sharing. OverlayFS 支持页面缓存共享。 Multiple containers accessing the same file share a single page cache entry for that file.访问同一文件的多个容器共享该文件的单个页面缓存条目。

This is in the context of layered docker image, where multiple containers access the same image.这是在分层 docker 镜像的上下文中，其中多个容器访问同一个镜像。 In my deployment, however, I can see very stable (in time) difference in page cache utilization by the same image running on different but similarly configured nodes of Kubernetes cluster.然而，在我的部署中，我可以看到在 Kubernetes 集群的不同但配置相似的节点上运行的相同图像在页面缓存利用率方面非常稳定（在时间上）差异。 Neither node has cache pressure that could lead to different reclaiming rates.两个节点都没有可能导致不同回收率的缓存压力。

So, I was wondering if the same file in the above excerpt could refer to the "sameness" verified by a hash, and the file could be actually a part of various docker images?所以，我想知道上面摘录中the same file是否可以指代通过哈希验证的“相同性”，并且该文件实际上可能是各种 docker 镜像的一部分？

The difference in question is on the order of 30-60MB, and that's consistent with python/libgc libraries my container uses.有问题的差异大约为 30-60MB，这与我的容器使用的 python/libgc 库一致。 It would all make sense if common shared libraries were "deduplicated" in page cache node-wide at overlayfs level.如果在overlayfs级别的页面缓存节点范围内对公共共享库进行“重复数据删除”，这一切都是有意义的。 The page cache would be counted to a cgroup on first-touch basis as per para 2.3 of https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt根据https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt 的第 2.3 段，页面缓存将在第一次触摸的基础上计入 cgroup

Therefore, the image running on the node, where same python libraries were used by other docker image(s), would show less utilization of the page cache by comparison with the node, where those libraries were used by my container only.因此，与其他 docker 镜像使用相同的 Python 库的节点上运行的镜像相比，这些库仅由我的容器使用的节点相比，页面缓存的利用率较低。

I am aware of abundant thought along deduplication, such as https://lwn.net/Articles/636943/ Y2015:我知道在重复数据删除方面有很多想法，例如https://lwn.net/Articles/636943/Y2015 ：

Chinner spoke up to describe the problem, which is that there might be a hundred containers running on a system all based on a snapshot of a single root filesystem. Chinner 发言描述了这个问题，即可能有一百个容器运行在一个系统上，所有容器都基于单个根文件系统的快照。 That means there will be a hundred copies of glibc in the page cache because they come from different namespaces with different inodes, so there is no sharing of the data.这意味着页面缓存中将有一百个 glibc 副本，因为它们来自具有不同 inode 的不同命名空间，因此没有数据共享。

And no, I am not using KSM, so no need to mention that.不，我没有使用 KSM，所以无需提及。 I would appreciate having some references to the source code shedding light on this behavior.我会很感激有一些对源代码的引用来阐明这种行为。

Answer 1

Upon further investigation it became clear that content from unrelated containers/pods is shared on a node that may be a reasonable security risk.经过进一步调查，很明显，来自不相关容器/Pod 的内容在可能存在合理安全风险的节点上共享。

Each line in a Dockerfile can represent 0,1 or many layers as per https://docs.docker.com/storage/storagedriver/ .根据https://docs.docker.com/storage/storagedriver/ ， Dockerfile 中的每一行都可以代表 0,1 层或多层。 For example, my current build FROM ubuntu gives three base layers in my image:例如，我当前的FROM ubuntu构建在我的图像中提供了三个基础层：

# docker inspect ubuntu : # docker inspect ubuntu ：

"GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/22abb0d6b77061cc1e3a04de4d3c83be15e60b87adebf9b7b2fa9adc0fbb0f2d/diff:/var/lib/docker/overlay2/7ab02c0180d53cfa2f444a10650a688c8cebd0368ddb2cea1dba7f01b2008d37/diff:/var/lib/docker/overlay2/3ee0e4ab0518c76376a4023f7c438cc6a8d28121eba9cdbed9440cfc7474204e/diff",

If I further say RUN apt-get -y install python , docker will create a layer containing all folders, files, and timestamps produced by that command.如果我进一步说RUN apt-get -y install python ，docker 将创建一个包含该命令生成的所有文件夹、文件和时间戳的层。 The layer will be tar 'd and sha256 will be taken from the tar file.该层将被tar 'd 和 sha256 将从 tar 文件中获取。 In the ubuntu example above you can see the mezzanine layer has the sha256sum: 3ee0e4ab0518c76376a4023f7c438cc6a8d28121eba9cdbed9440cfc7474204e在上面的 ubuntu 示例中，您可以看到夹层具有 sha256sum： 3ee0e4ab0518c76376a4023f7c438cc6a8d28121eba9cdbed9440cfc7474204e

Once my image is orchestrated by Kubernetes cluster, the layer will be inflated to a standard location on the node where the image is run.一旦我的映像由 Kubernetes 集群编排，该层将被扩展到运行映像的节点上的标准位置。 A link will be created to the layer folder - only reason for the link is to make the path shorter as explained here:https://docs.docker.com/storage/storagedriver/overlayfs-driver/ .将创建一个指向图层文件夹的链接 - 链接的唯一原因是使路径更短，如下所述：https ://docs.docker.com/storage/storagedriver/overlayfs-driver/ 。 So a node running an image built from ubuntu will have something similar to:因此，运行从 ubuntu 构建的映像的节点将具有类似于：

# ls -l /var/lib/docker/overlay2/l |grep 3ee0e4ab0518c76
lrwxrwxrwx    1 root     root            72 Dec 13 15:40 VGN2ARTYLKI6LQWXSZSMKUQOQL -> ../3ee0e4ab0518c76376a4023f7c438cc6a8d28121eba9cdbed9440cfc7474204e/diff

Note, that VGN2ARTYLKI6LQWXSZSMKUQOQL here is a/ node unique and b/ node specific.请注意，此处的VGN2ARTYLKI6LQWXSZSMKUQOQL是 a/ 节点唯一的和 b/ 节点特定的。 And this identifier will appear in mounts for containers.这个标识符将出现在容器的挂载中。 Root cgroup sees all the mounts on a node, and pid 1 would normally belong to the root cgroup.根 cgroup 看到节点上的所有挂载，pid 1 通常属于根 cgroup。 So the layer in question is shared like so:所以有问题的层是这样共享的：

# grep `ls -l /var/lib/docker/overlay2/l |grep 3ee0e4ab0518c76 |awk '{print$9}'` /proc/1/mounts

overlay /var/lib/docker/overlay2/84ec5295eb902ab01b37451f9063987f5803a0ff4bc53ee27c1838f783f61f48/merged overlay rw,relatime,lowerdir=
/var/lib/docker/overlay2/l/7RBRYLLCPECAY5IXIQWNNFMT4L:
/var/lib/docker/overlay2/l/LK4X5JGJE327XH6STN6DHMQZUI:
/var/lib/docker/overlay2/l/2RODCFKARIMWO2NUPHVP7HREVF:
/var/lib/docker/overlay2/l/DH43WT4W2DPJTMMKHJL46IPIXM:
/var/lib/docker/overlay2/l/DQBSRPR7QCKCXNT4QQHHC6L2TO:
/var/lib/docker/overlay2/l/N3NL6BAOEKFZYIAXCCFEHMRJC2:
/var/lib/docker/overlay2/l/VGN2ARTYLKI6LQWXSZSMKUQOQL,upperdir=/var/lib/docker/overlay2/84ec5295eb902ab01b37451f9063987f5803a0ff4bc53ee27c1838f783f61f48/diff,workdir=/var/lib/docker/overlay2/84ec5295eb902ab01b37451f9063987f5803a0ff4bc53ee27c1838f783f61f48/work 0 0

overlay /var/lib/docker/overlay2/89ce211716bd81100b99ecacc3c9da7af602029b2724d01db41d5efad37f43e6/merged overlay rw,relatime,lowerdir=
/var/lib/docker/overlay2/l/SQEWZDFCQQX6EKH7IZHSFXKLBN:
/var/lib/docker/overlay2/l/TJFM5IIGAQIKCMA5LDT6X4NUJK:
/var/lib/docker/overlay2/l/DQBSRPR7QCKCXNT4QQHHC6L2TO:
/var/lib/docker/overlay2/l/N3NL6BAOEKFZYIAXCCFEHMRJC2:
/var/lib/docker/overlay2/l/VGN2ARTYLKI6LQWXSZSMKUQOQL,upperdir=/var/lib/docker/overlay2/89ce211716bd81100b99ecacc3c9da7af602029b2724d01db41d5efad37f43e6/diff,workdir=/var/lib/docker/overlay2/89ce211716bd81100b99ecacc3c9da7af602029b2724d01db41d5efad37f43e6/work 0 0

Two overlay mounts mean that the layer is shared between 2 running containers built from this version of ubuntu image.两个覆盖挂载意味着该层在从这个版本的 ubuntu 镜像构建的 2 个正在运行的容器之间共享。 Or more concise:或者更简洁：

# grep `ls -l /var/lib/docker/overlay2/l |grep 3ee0e4ab0518c76 |awk ‘{print$9}’` /proc/1/mounts |wc -l
2

This confirms that the content is shared between unrelated containers and explains the difference in page cache utilization I see.这证实了内容在不相关的容器之间共享，并解释了我看到的页面缓存利用率的差异。 Is it a security risk?是否存在安全风险？ In theory, an adversary could plant malicious code in ubuntu wrappers and devise a nonce yielding the same sha256.理论上，攻击者可以在 ubuntu 包装器中植入恶意代码并设计一个随机数产生相同的 sha256。 Is it a risk in practice?在实践中是否存在风险？ Probably not so much...可能没那么多...

来自各种 docker 镜像的同一个文件会在 k8s 节点中被页面缓存一次吗？

问题描述

1 个解决方案

解决方案1
0 2020-01-24 16:19:47

来自各种 docker 镜像的同一个文件会在 k8s 节点中被页面缓存一次吗？

问题描述

1 个解决方案

解决方案1 0 2020-01-24 16:19:47

解决方案1
0 2020-01-24 16:19:47