Kubernetes pod 无法访问在另一个节点上运行的服务

Question

I'm trying to setup a k8s cluster.我正在尝试设置一个 k8s 集群。 I've already deployed an ingress controller and a cert manager.我已经部署了一个入口控制器和一个证书管理器。 However, currently I'm trying to deploy a first small service (Spring Cloud Config Server) and noticed that my pods cannot access services that are running on other nodes.但是，目前我正在尝试部署第一个小服务（Spring Cloud Config Server）并注意到我的 pod 无法访问在其他节点上运行的服务。

The pod tries to resolve a dns name which is publicly available and fails in this attempt due to a timeout while reaching the coredns-service. pod 尝试解析一个公共可用的 dns 名称，但由于到达 coredns-service 时超时而导致此尝试失败。

My Cluster looks like this:我的集群看起来像这样：

Nodes:节点：

NAME         STATUS   ROLES    AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION              CONTAINER-RUNTIME
k8s-master   Ready    master   6d17h   v1.17.2   10.0.0.10     <none>        CentOS Linux 7 (Core)   5.5.0-1.el7.elrepo.x86_64   docker://19.3.5
node-1       Ready    <none>   6d17h   v1.17.2   10.0.0.11     <none>        CentOS Linux 7 (Core)   5.5.0-1.el7.elrepo.x86_64   docker://19.3.5
node-2       Ready    <none>   6d17h   v1.17.2   10.0.0.12     <none>        CentOS Linux 7 (Core)   5.5.0-1.el7.elrepo.x86_64   docker://19.3.5

Pods:豆荚：

NAMESPACE       NAME                                      READY   STATUS             RESTARTS   AGE     IP            NODE         NOMINATED NODE   READINESS GATES
cert-manager    cert-manager-c6cb4cbdf-kcdhx              1/1     Running            1          23h     10.244.2.22   node-2       <none>           <none>
cert-manager    cert-manager-cainjector-76f7596c4-5f2h8   1/1     Running            3          23h     10.244.1.21   node-1       <none>           <none>
cert-manager    cert-manager-webhook-8575f88c85-b7vcx     1/1     Running            1          23h     10.244.2.23   node-2       <none>           <none>
ingress-nginx   ingress-nginx-5kghx                       1/1     Running            1          6d16h   10.244.1.23   node-1       <none>           <none>
ingress-nginx   ingress-nginx-kvh5b                       1/1     Running            1          6d16h   10.244.0.6    k8s-master   <none>           <none>
ingress-nginx   ingress-nginx-rrq4r                       1/1     Running            1          6d16h   10.244.2.21   node-2       <none>           <none>
project1        config-server-7897679d5d-q2hmr            0/1     CrashLoopBackOff   1          103m    10.244.1.22   node-1       <none>           <none>
project1        config-server-7897679d5d-vvn6s            1/1     Running            1          21h     10.244.2.24   node-2       <none>           <none>
kube-system     coredns-6955765f44-7ttww                  1/1     Running            2          6d17h   10.244.2.20   node-2       <none>           <none>
kube-system     coredns-6955765f44-b57kq                  1/1     Running            2          6d17h   10.244.2.19   node-2       <none>           <none>
kube-system     etcd-k8s-master                           1/1     Running            5          6d17h   10.0.0.10     k8s-master   <none>           <none>
kube-system     kube-apiserver-k8s-master                 1/1     Running            5          6d17h   10.0.0.10     k8s-master   <none>           <none>
kube-system     kube-controller-manager-k8s-master        1/1     Running            8          6d17h   10.0.0.10     k8s-master   <none>           <none>
kube-system     kube-flannel-ds-amd64-f2lw8               1/1     Running            11         6d17h   10.0.0.10     k8s-master   <none>           <none>
kube-system     kube-flannel-ds-amd64-kt6ts               1/1     Running            11         6d17h   10.0.0.11     node-1       <none>           <none>
kube-system     kube-flannel-ds-amd64-pb8r9               1/1     Running            12         6d17h   10.0.0.12     node-2       <none>           <none>
kube-system     kube-proxy-b64jt                          1/1     Running            5          6d17h   10.0.0.12     node-2       <none>           <none>
kube-system     kube-proxy-bltzm                          1/1     Running            5          6d17h   10.0.0.10     k8s-master   <none>           <none>
kube-system     kube-proxy-fl9xb                          1/1     Running            5          6d17h   10.0.0.11     node-1       <none>           <none>
kube-system     kube-scheduler-k8s-master                 1/1     Running            7          6d17h   10.0.0.10     k8s-master   <none>           <none>

Services:服务：

NAMESPACE       NAME                                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE     SELECTOR
cert-manager    cert-manager                         ClusterIP   10.102.188.88    <none>        9402/TCP                     23h     app.kubernetes.io/instance=cert-manager,app.kubernetes.io/name=cert-manager
cert-manager    cert-manager-webhook                 ClusterIP   10.96.98.94      <none>        443/TCP                      23h     app.kubernetes.io/instance=cert-manager,app.kubernetes.io/managed-by=Helm,app.kubernetes.io/name=webhook,app=webhook
default         kubernetes                           ClusterIP   10.96.0.1        <none>        443/TCP                      6d17h   <none>
ingress-nginx   ingress-nginx                        NodePort    10.101.135.13    <none>        80:31080/TCP,443:31443/TCP   6d16h   app.kubernetes.io/name=ingress-nginx,app.kubernetes.io/part-of=ingress-nginx
project1        config-server                        ClusterIP   10.99.94.55      <none>        80/TCP                       24h     app=config-server,release=config-server
kube-system     kube-dns                             ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP       6d17h   k8s-app=kube-dns

I've noticed that my newly deployed service has no access to the coredns service on node-1.我注意到我新部署的服务无法访问 node-1 上的 coredns 服务。 My coredns service has two pods of which no one is running on node-1.我的 coredns 服务有两个 pod，其中没有人在 node-1 上运行。 If I understand it correctly it should be possible to access the coredns pods via the service ip (10.96.0.10) on every node whether or not it runs on it.如果我理解正确，应该可以通过每个节点上的服务 ip (10.96.0.10) 访问 coredns pod，无论它是否在其上运行。

I've already noticed that the routing tables on the nodes look like this:我已经注意到节点上的路由表如下所示：

default via 172.31.1.1 dev eth0 
10.0.0.0/16 via 10.0.0.1 dev eth1 proto static 
10.0.0.1 dev eth1 scope link 
10.244.0.0/24 via 10.244.0.0 dev flannel.1 onlink 
10.244.1.0/24 dev cni0 proto kernel scope link src 10.244.1.1 
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
172.31.1.1 dev eth0 scope link

So as you see there is no route to the 10.96.0.0/16 network.因此，如您所见，没有通往 10.96.0.0/16 网络的路由。

I've already checked the ports and the net.bridge.bridge-nf-call-iptables and net.bridge.bridge-nf-call-ip6tables sysctl values.我已经检查了端口和net.bridge.bridge-nf-call-iptables和net.bridge.bridge-nf-call-ip6tables sysctl 值。 All flannel ports are reachable and should be able to receive traffic over the 10.0.0.0/24 network.所有法兰绒端口都可以访问，并且应该能够通过 10.0.0.0/24 网络接收流量。

Here is the output of iptables -L on the node-1:这是节点 1 上iptables -L的输出：

Chain INPUT (policy ACCEPT)
target                  prot opt source               destination         
KUBE-SERVICES           all  --  anywhere             anywhere             ctstate NEW /* kubernetes service portals */
KUBE-EXTERNAL-SERVICES  all  --  anywhere             anywhere             ctstate NEW /* kubernetes externally-visible service portals */
KUBE-FIREWALL           all  --  anywhere             anywhere            
ACCEPT                  tcp  --  anywhere             anywhere             tcp dpt:22
ACCEPT                  icmp --  anywhere             anywhere            
ACCEPT                  udp  --  anywhere             anywhere             udp spt:ntp
ACCEPT                  tcp  --  10.0.0.0/24          anywhere            
ACCEPT                  udp  --  10.0.0.0/24          anywhere            
ACCEPT                  all  --  anywhere             anywhere             state RELATED,ESTABLISHED
LOG                     all  --  anywhere             anywhere             limit: avg 15/min burst 5 LOG level debug prefix "Dropped by firewall: "
DROP                    all  --  anywhere             anywhere            

Chain FORWARD (policy DROP)
target                    prot opt source               destination         
KUBE-FORWARD              all  --  anywhere             anywhere             /* kubernetes forwarding rules */
KUBE-SERVICES             all  --  anywhere             anywhere             ctstate NEW /* kubernetes service portals */
DOCKER-USER               all  --  anywhere             anywhere            
DOCKER-ISOLATION-STAGE-1  all  --  anywhere             anywhere            
ACCEPT                    all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER                    all  --  anywhere             anywhere            
ACCEPT                    all  --  anywhere             anywhere            
ACCEPT                    all  --  anywhere             anywhere            
ACCEPT                    all  --  10.244.0.0/16        anywhere            
ACCEPT                    all  --  anywhere             10.244.0.0/16       

Chain OUTPUT (policy ACCEPT)
target         prot opt source               destination         
KUBE-SERVICES  all  --  anywhere             anywhere             ctstate NEW /* kubernetes service portals */
KUBE-FIREWALL  all  --  anywhere             anywhere            
ACCEPT         udp  --  anywhere             anywhere             udp dpt:ntp

Chain DOCKER (1 references)
target     prot opt source               destination         

Chain DOCKER-ISOLATION-STAGE-1 (1 references)
target                    prot opt source               destination         
DOCKER-ISOLATION-STAGE-2  all  --  anywhere             anywhere            
RETURN                    all  --  anywhere             anywhere            

Chain DOCKER-ISOLATION-STAGE-2 (1 references)
target     prot opt source               destination         
DROP       all  --  anywhere             anywhere            
RETURN     all  --  anywhere             anywhere            

Chain DOCKER-USER (1 references)
target     prot opt source               destination         
RETURN     all  --  anywhere             anywhere            

Chain KUBE-EXTERNAL-SERVICES (1 references)
target     prot opt source               destination         

Chain KUBE-FIREWALL (2 references)
target     prot opt source               destination         
DROP       all  --  anywhere             anywhere             /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000

Chain KUBE-FORWARD (1 references)
target     prot opt source               destination         
DROP       all  --  anywhere             anywhere             ctstate INVALID
ACCEPT     all  --  anywhere             anywhere             /* kubernetes forwarding rules */ mark match 0x4000/0x4000
ACCEPT     all  --  10.244.0.0/16        anywhere             /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             10.244.0.0/16        /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED

Chain KUBE-KUBELET-CANARY (0 references)
target     prot opt source               destination         

Chain KUBE-SERVICES (3 references)
target     prot opt source               destination         
REJECT     tcp  --  anywhere             10.99.94.55          /* project1/config-server:http has no endpoints */ tcp dpt:http reject-with icmp-port-unreachable

The cluster is deployed via ansible.集群是通过 ansible 部署的。

I'm sure I'm doing anything wrong.我确定我做错了什么。 However I couldn't see it.然而我看不出来。 Can somebody help me here?有人可以帮我吗？

Thanks谢谢

Answer 1

I experienced the same issue on Kubernetes with the Calico network stack under Debian Buster.我在 Kubernetes 和 Debian Buster 下的 Calico 网络堆栈上遇到了同样的问题。

Checking a lot of configs and parameters, I ended up with getting it to work by changing the policy for the forward rule to ACCEPT .检查了很多配置和参数，我最终通过将转发规则的策略更改为ACCEPT使其工作。 This made it clear that the issue is somewhere around the firewall.这清楚地表明问题出在防火墙附近。 Due to security considerations I changed it back.出于安全考虑，我把它改回来了。

Running iptables -L gave me the following unveiling warning: # Warning: iptables-legacy tables present, use iptables-legacy to see them运行iptables -L给了我以下揭幕警告： # Warning: iptables-legacy tables present, use iptables-legacy to see them

The output given by the list command does not contain any Calico rules. list 命令给出的输出不包含任何 Calico 规则。 Running iptables-legacy -L showed me the Calico rules, so it seems obvious now why it didn't work.运行iptables-legacy -L向我展示了 Calico 规则，所以现在它为什么不起作用似乎很明显。 So Calico seems to use the legacy interface.所以 Calico 似乎使用了遗留接口。

The issue is the change in Debian to iptables-nft in the alternatives, you can check via:问题是 Debian 中的iptables-nft在替代方案中的更改，您可以通过以下方式检查：

ls -l /etc/alternatives | grep iptables

Doing the following:执行以下操作：

update-alternatives --set iptables /usr/sbin/iptables-legacy
update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy
update-alternatives --set arptables /usr/sbin/arptables-legacy
update-alternatives --set ebtables /usr/sbin/ebtables-legacy

Now it works all fine!现在一切正常！ Thanks to Long at the Kubernetes Slack channel for pointing the route to solving it.感谢 Kubernetes Slack 频道的Long指出解决问题的途径。

Answer 2

I've followed the suggestion from Dawid Kruk and tried it with kubespray.我遵循了 Dawid Kruk 的建议，并使用 kubespray 进行了尝试。 Now it works as intended.现在它按预期工作。 If I'm able to figure out were my mistake was, I would post it here for the future.如果我能弄清楚我的错误是什么，我会在这里发布以备将来使用。

Edit: Solution编辑：解决方案

My firewall rules were too restrictive.我的防火墙规则太严格了。 Flannel creates a new interfaces and since my rules are not restricted to my main interface nearly every package from flannel was dropped. Flannel 创建了一个新的接口，因为我的规则不限于我的主接口，几乎所有来自 flannel 的包都被删除了。 If I had viewed the journalctl more attentive, I've found the issue earlier.如果我更仔细地查看 journalctl，我会更早发现问题。

Answer 3

I am not sure what is the exact issue here.我不确定这里的确切问题是什么。 But I would like to clarify few things to make things more clear.但我想澄清一些事情以使事情更清楚。

Cluster IPs are virtual IPs.集群 IP 是虚拟 IP。 They are not routed via routing tables.它们不通过路由表路由。 Instead, for each cluster IP, kube-proxy adds NAT table entries on its respective node.相反，对于每个集群 IP，kube-proxy 在其各自的节点上添加 NAT 表条目。 To check those entries, execute command sudo iptables -t nat -L -n -v .要检查这些条目，请执行命令sudo iptables -t nat -L -n -v 。

Now, core dns pods are exposed via a service cluster IP.现在，核心 dns pod 通过服务集群 IP 公开。 Hence, whenever a packet comes to a node having destination address as cluster IP, its destination address is changed to the pod IP address which is routable from all the nodes (thanks to flannel).因此，每当数据包到达目标地址为集群 IP 的节点时，其目标地址将更改为可从所有节点路由的 pod IP 地址（感谢 flannel）。 This change in destination address is done via a DNAT target entry in the iptables which looks like below.目标地址的这种更改是通过 iptables 中的 DNAT 目标条目完成的，如下所示。

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination   
KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  anywhere             10.96.0.10           /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain

Chain KUBE-SVC-ERIFXISQEP7F7OF4 (1 references)
target     prot opt source               destination         
KUBE-SEP-IT2ZTR26TO4XFPTO  all  --  anywhere             anywhere             statistic mode random probability 0.50000000000
KUBE-SEP-ZXMNUKOKXUTL2MK2  all  --  anywhere             anywhere           

Chain KUBE-SEP-IT2ZTR26TO4XFPTO (1 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  all  --  10.244.0.2           anywhere            
DNAT       tcp  --  anywhere             anywhere             tcp to:10.244.0.2:53

Chain KUBE-SEP-ZXMNUKOKXUTL2MK2 (1 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  all  --  10.244.0.3           anywhere            
DNAT       tcp  --  anywhere             anywhere             tcp to:10.244.0.3:53

Hence, if you can re-simulate the issue, try checking nat table entries to see if everything is proper.因此，如果您可以重新模拟该问题，请尝试检查 nat 表条目以查看是否一切正常。

Kubernetes pod 无法访问在另一个节点上运行的服务

问题描述

3 个解决方案

解决方案1
2 2020-05-15 13:29:01

解决方案2
1 2020-02-04 18:36:13

解决方案3
1 2020-02-08 17:56:35

Kubernetes pod 无法访问在另一个节点上运行的服务

问题描述

3 个解决方案

解决方案1 2 2020-05-15 13:29:01

解决方案2 1 2020-02-04 18:36:13

解决方案3 1 2020-02-08 17:56:35

解决方案1
2 2020-05-15 13:29:01

解决方案2
1 2020-02-04 18:36:13

解决方案3
1 2020-02-08 17:56:35