简体繁体 English

调试 nfs 卷“无法为 pod 附加或安装卷”

[英]Debugging nfs volume "Unable to attach or mount volumes for pod"

原文 2021-10-06 08:44:47 9 2 kubernetes/ google-kubernetes-engine/ mount/ nfs/ persistent-volumes

I've set up an nfs server that serves a RMW pv according to the example at https://github.com/kube.netes/examples/tree/master/staging/volumes/nfs我已经根据https 中的示例设置了一个服务 RMW pv 的 nfs 服务器：//github.com/kube.netes/examples/tree/master/staging/volumes/nfs

This setup works fine for me in lots of production environments, but in some specific GKE cluster instance, mount stopped working after pods restarted.此设置在许多生产环境中对我来说都很好，但在某些特定的 GKE 集群实例中，mount 在 pod 重新启动后停止工作。

From kubelet logs I see the following repeating many times从 kubelet 日志中，我看到以下内容重复了很多次

Unable to attach or mount volumes for pod "api-bf5869665-zpj4c_default(521b43c8-319f-425f-aaa7-e05c08282e8e)": unmounted volumes=[shared-mount], unattached volumes=[geekadm.net deployment-role-token-6tg9p shared-mount]: timed out waiting for the condition;无法为 pod“api-bf5869665-zpj4c_default(521b43c8-319f-425f-aaa7-e05c08282e8e)”附加或装载卷：卸载的卷=[shared-mount]，未附加的卷=[geekadm.net deployment-role-token-6tg9p shared-mount]：等待条件超时； skipping pod跳绳

Error syncing pod 521b43c8-319f-425f-aaa7-e05c08282e8e ("api-bf5869665-zpj4c_default(521b43c8-319f-425f-aaa7-e05c08282e8e)"), skipping: unmounted volumes=[shared-mount], unattached volumes=[geekadm.net deployment-role-token-6tg9p shared-mount]: timed out waiting for the condition同步 pod 521b43c8-319f-425f-aaa7-e05c08282e8e 时出错（“api-bf5869665-zpj4c_default（521b43c8-319f-425f-aaa7-e05c08282e8e）”），跳过：未安装的卷=[共享安装]，未连接的卷=[geekadm net deployment-role-token-6tg9p shared-mount]：等待条件超时

Manually mounting the nfs on any of the nodes work just fine: mount -t nfs <service ip>:/ /tmp/mnt在任何节点上手动安装 nfs 都可以正常工作： mount -t nfs <service ip>:/ /tmp/mnt

How can I further debug the issue?我怎样才能进一步调试这个问题？ Are there any other logs I could look at besides kubelet?除了 kubelet 之外，还有其他我可以查看的日志吗？

2 个解决方案

In case the pod gets kicked out of the node because the mount is too slow, you may see messages like that in logs.如果 pod 由于挂载速度太慢而被踢出节点，您可能会在日志中看到类似的消息。

Kubelets even inform about this issue in logs. Kubelets 甚至在日志中告知这个问题。
Sample log from Kubelets:来自 Kubelet 的示例日志：
Setting volume ownership for /var/lib/kubelet/pods/c9987636-acbe-4653-8b8d- aa80fe423597/volumes/kube.netes.io~gce-pd/pvc-fbae0402-b8c7-4bc8-b375- 1060487d730d and fsGroup set.为 /var/lib/kubelet/pods/c9987636-acbe-4653-8b8d-aa80fe423597/volumes/kube.netes.io~gce-pd/pvc-fbae0402-b8c7-4bc8-b375-1060487d730d 和 fsGroup 集设置卷所有权。 If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kube.netes/kube.netes/issues/69699如果该卷有很多文件，那么设置卷所有权可能会很慢，请参阅https://github.com/kube.netes/kube.netes/issues/69699

Cause:原因：
The pod.spec.securityContext.fsGroup setting causes kubelet to run chown and chmod on all the files in the volumes mounted for given pod. pod.spec.securityContext.fsGroup 设置使 kubelet 对为给定 pod 安装的卷中的所有文件运行 chown 和 chmod。 This can be a very time consuming thing to do in case of big volumes with many files.如果文件很大且文件很多，这可能是一件非常耗时的事情。

By default, Kube.netes recursively changes ownership and permissions for the contents of each volume to match the fsGroup specified in a Pod's securityContext when that volume is mounted.默认情况下，Kube.netes 递归地更改每个卷内容的所有权和权限，以匹配挂载该卷时在 Pod 的 securityContext 中指定的 fsGroup。 From the document .从文件。

Solution:解决方案：
You can deal with it in the following ways.您可以通过以下方式处理。

Reduce the number of files in the volume.减少卷中的文件数量。
Stop using the fsGroup setting.停止使用 fsGroup 设置。

Did you specify an nfs version when mounting command-line?安装命令行时是否指定了 nfs 版本？ I had the same issue on AKS, but inspired by https://stackoverflow.com/a/71789693/1382108 I checked the nfs versions.我在 AKS 上遇到了同样的问题，但受到https://stackoverflow.com/a/71789693/1382108的启发，我检查了 nfs 版本。 Noticed my PV had a vers=3.注意到我的 PV 有一个 vers=3。 When I tried mounting command-line using mount -t nfs -o vers=3 command just hung, with vers=4.1 it worked immediately.当我尝试使用挂起的mount -t nfs -o vers=3命令挂载命令行时， vers=4.1它立即起作用。 Changed the version in my PV and next Pod worked just fine.更改了我的 PV 中的版本，下一个 Pod 工作正常。