GKE Autopilot 上的普羅米修斯？

Question

目前在我在 Prometheus 的kubernetes-nodes作業中，端點/api/v1/nodes/gk3-<cluster name>-default-pool-<something arbitrary>/proxy/metrics正在被抓取

但問題是我收到一個 403 錯誤，上面寫着GKEAutopilot authz: cluster scoped resource "nodes/proxy" is managed and access is denied when I try it on postman

如何在 GKE Autopilot 上解決此問題？

Answer 1

雖然 Autopilot 文檔沒有特別提到節點代理 API，但這是在限制部分：

大多數外部監控工具都需要限制訪問。 來自多個 Google Cloud 合作伙伴的解決方案可用於 Autopilot，但並非所有解決方案都受支持，並且無法在 Autopilot 集群上安裝自定義監控工具。

鑒於端口轉發和所有其他節點級訪問受到限制，這似乎不可用。 目前還不清楚 Autopilot 是否完全使用 Kubelet，他們可能不會告訴你。

年終更新：

這現在大部分都有效。 Autopilot 增加了對集群范圍對象和 webhook 等內容的支持。 您確實需要重新配置任何安裝清單以不觸及kube-system命名空間，因為它仍然被鎖定，但是如果您大量使用它，您可以大部分工作。

Answer 2

Created a firewall to allow ingress traffic to port 10250-10255 (kubelet)
     $ gcloud compute firewall-rules create test-kubelet-ingress --allow tcp:10250-10255 --source-ranges="0.0.0.0/0"
Ran the following to:
### make sure the user can create nodes/proxy
  $  kubectl config view
  $ kubectl get all --all-namespaces
  $ kubectl create clusterrolebinding autopilot-cluster-1 --clusterrole=k8-cluster-1 --user=infosys-khajashaik@premium-cloud-support.com
### checking
   $ kubectl auth can-i create nodes/proxy
#> output
# Warning: resource 'nodes' is not namespace scoped
# yes
  $ curl -k https://{NODE_PUBLIC_IP}:10250/run/kube-system/{POD_NAME}/netd -d "cmd=ls" --header "Authorization: Bearer $TOKEN" --insecure
TOKEN = <auto generated token in local kubeconfig>
NODE_PUBLIC_IP = <the public ip of the node>
POD_NAME = <netd pod name in the node>
So even though the user has permissions in the kube-apiserver, it is denied to create a "nodes/proxy" by kubelet.
If nodes/proxy is removed from the authz, it success creating a proxy
$ curl -k https://35.202.254.215:10250/run/kube-system/netd-ff5vr/netd -d "cmd=ls" --header "Authorization: Bearer $TOKEN" --insecure

Answer 3

看起來 GKE Autopilot 拒絕訪問“節點/代理”。

但似乎 Kubelet 指標是可用的。 例如，您可以通過以下方式從集群內訪問它們：

curl  [Node_Internal_IP]:10255/metrics

我最終使用以下抓取配置直接抓取 Kubelets，而不是通過代理：

- job_name: kubernetes-nodes
  kubernetes_sd_configs:
   - role: node
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  scheme: https
  metrics_path: /metrics/cadvisor

您需要在ClusterRole上使用此 RBAC：

- apiGroups: [""]
  resources: ["nodes/metrics"]
  verbs: ["get"]

使用上述方法，可以從 GKE Autopilot 集群中的 Kubelet 中抓取容器資源指標。

Answer 4

在 2022 年 6 月至 7 月左右，現在似乎可以做到這一點。 但是有很多細微差別。 大多數細微差別是因為如果您閱讀https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview#unsupported_cluster_features您會發現 GKE Autopilot 集群有大量限制以換取安全性，可靠性，以及不干涉的 UX（用戶體驗）。 GKE 1.22 的默認防火牆規則也發生了變化，除非 webhook 使用端口 443，否則您需要更新防火牆規則。

所以我將從一些背景背景開始：

至少有 3 種 prometheus 可以部署到 GKE 自動駕駛儀：

上游 Prometheus Operator，在集群上部署自托管的 prometheus。
GCP 的“托管收集器”，將指標作為托管服務發送到 GCP 的 Prometheus
GCP 的“非托管收集器”，將指標作為托管服務發送到 GCP 的 Prometheus。

什么是上游普羅米修斯算子？
Prometheus 是一個具有 2 個角色的服務器：1 它是一個時間序列數據庫，2 它將指標刮/拉到自身中以存儲在它的自托管時間序列數據庫中。

什么是 GCP 的 Prometheus 即托管服務？
它是 Monarch 之上的 Prometheus API 兼容層（GCP 用於度量的內部時間序列數據庫/與 GCP 平台的 Metrics Explorer 使用的時間序列數據庫相同的時間序列數據庫。）

GCP的收集器是什么？
它是 prometheus 服務器的替代品，但與上游 prometheus 服務器的工作方式不同。 它是一個瘦包裝器，只收集 prometheus 指標，然后將它們推送到 GCP 的托管 prometheus 服務。 （我懷疑托管和非托管之間的區別只是 kubernetes 運算符，如果您閱讀他們建議使用托管收集器的文檔。）

以下是如何讓 GCP 的“托管收集器”（他們推薦）部署到 GKE Autopilot 集群，並驗證它們將使用 GCP 的托管 prometheus 服務：

配置一個 GKE Autopilot 集群，然后按照這個頁面說明如何設置托管收集器https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed#config-mgd-collection

它為您提供了 4 個設置選項：控制台、gcloud CLI、Terraform 和 kubectl CLI。 前 3 個選項（控制台、gcloud CLI 和 Terraform）不起作用，嘗試將返回一個錯誤，說明它不受支持。 我懷疑他們都使用相同的 API 並且我懷疑不受支持的錯誤消息基於以下 2 頁已過時：
https://issuehint.com/issue/GoogleCloudPlatform/prometheus-engine/148
https://issuehint.com/issue/GoogleCloudPlatform/prometheus-engine/186
忽略警告並使用 kubectl 方法，它會起作用。

該頁面還有 3 個命令可以運行以生成一些測試數據

kubectl create ns gmp-test

kubectl -n gmp-test apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.4.3-gke.0/examples/example-app.yaml

kubectl -n gmp-test apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.4.3-gke.0/examples/pod-monitoring.yaml

您可以通過執行以下操作查看測試數據：

kubectl get pod -o wide -n=gmp-test

在我的情況下，查找指標為 10.8.0.132 的 pod 的 IP 地址
然后運行

kubectl run curlpod -it --image=curlimages/curl -- sh
# got a message about timed out waiting for the condition which I ignored
kubectl exec -it curlpod -- curl 10.8.0.132:1234/metrics
# (I figured out :1234/metrics, by looking at the yaml manifest of the example pod)

curl 顯示代表舞會指標的大文本牆

example_random_numbers_bucket{le="0"} 0
...
go_memstats_stack_inuse_bytes 360448
...
process_virtual_memory_max_bytes -1

此時一切正常，但如果您在 GCP 的 GUI 控制台中將 go 用於 Prometheus 的托管服務，由於前面的步驟我知道一個示例舞會指標，所以它不會立即給出明顯的反饋，表明事情正在運行，所以我插入 go_memstats_stack_inuse_bytes進入 GUI 並說運行查詢，我可以驗證它是否按預期工作。

Answer 5

請參閱https://github.com/SumoLogic/sumologic-kubernetes-collection/issues/1468#issuecomment-1005800179在 GKE Autopilot 上安裝 Prometheus 的解決方法。

GKE Autopilot 上的普羅米修斯？

問題描述

4 個解決方案

解決方案1
3 已采納 2021-05-19 08:37:47

解決方案2
1 2021-05-19 09:25:57

解決方案3
1 2022-05-18 16:02:27

解決方案4
0 2022-07-21 12:26:10

解決方案5
-2 2022-01-05 15:30:09

GKE Autopilot 上的普羅米修斯？

問題描述

4 個解決方案

解決方案1 3 已采納 2021-05-19 08:37:47

解決方案2 1 2021-05-19 09:25:57

解決方案3 1 2022-05-18 16:02:27

解決方案4 0 2022-07-21 12:26:10

解決方案5 -2 2022-01-05 15:30:09

解決方案1
3 已采納 2021-05-19 08:37:47

解決方案2
1 2021-05-19 09:25:57

解決方案3
1 2022-05-18 16:02:27

解決方案4
0 2022-07-21 12:26:10

解決方案5
-2 2022-01-05 15:30:09