简体   繁体   English

无法从 datadog 获取 hpa 的自定义指标

[英]can't get custom metrics for hpa from datadog

hey guys i'm trying to setup datadog as custom metric for my kubernetes hpa using the official guide:嘿伙计们,我正在尝试使用官方指南将 datadog 设置为我的 kubernetes hpa 的自定义指标:

https://docs.datadoghq.com/agent/cluster_agent/external_metrics/?tab=helm https://docs.datadoghq.com/agent/cluster_agent/external_metrics/?tab=helm

running on EKS 1.18 & Datadog Cluster Agent ( v1.10.0 ).EKS 1.18和 Datadog 集群代理 ( v1.10.0 ) 上运行。 the problem is that i can't get the external metrics's for my HPA:问题是我无法获得 HPA 的外部指标:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: hibob-hpa
spec:
  minReplicas: 1
  maxReplicas: 5
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: something
  metrics:
  - type: External
    external:
      metricName: **kubernetes_state.container.cpu_limit**
      metricSelector:
        matchLabels:
            pod: **something-54c4bd4db7-pm9q5**
      targetAverageValue: 9

horizontal-pod-autoscaler unable to get external metric: Horizo​​ntal-pod-autoscaler 无法获取外部指标:

canary/nginx.net.request_per_s/&LabelSelector{MatchLabels:map[string]string{kube_app_name: nginx,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get nginx.net.request_per_s.external.metrics.k8s.io)

This is the errors i'm getting inside the cluster-agent:这是我在集群代理中遇到的错误:

datadog-cluster-agent-585897dc8d-x8l82 cluster-agent 2021-08-20 06:46:14 UTC | CLUSTER | ERROR | (pkg/clusteragent/externalmetrics/metrics_retriever.go:77 in retrieveMetricsValues) | Unable to fetch external metrics: [Error while executing metric query avg:nginx.net.request_per_s{kubea_app_name:ingress-nginx}.rollup(30): API error 403 Forbidden: {"status":********@datadoghq.com"}, strconv.Atoi: parsing "": invalid syntax]
# datadog-cluster-agent status
Getting the status from the agent.
2021-08-19 15:28:21 UTC | CLUSTER | WARN | (pkg/util/log/log.go:541 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
===============================
Datadog Cluster Agent (v1.10.0)
===============================

  Status date: 2021-08-19 15:28:21.519850 UTC
  Agent start: 2021-08-19 12:11:44.266244 UTC
  Pid: 1
  Go Version: go1.14.12
  Build arch: amd64
  Agent flavor: cluster_agent
  Check Runners: 4
  Log Level: INFO

  Paths
  =====
    Config File: /etc/datadog-agent/datadog-cluster.yaml
    conf.d: /etc/datadog-agent/conf.d

  Clocks
  ======
    System UTC time: 2021-08-19 15:28:21.519850 UTC

  Hostnames
  =========
    ec2-hostname: ip-10-30-162-8.eu-west-1.compute.internal
    hostname: i-00d0458844a597dec
    instance-id: i-00d0458844a597dec
    socket-fqdn: datadog-cluster-agent-585897dc8d-x8l82
    socket-hostname: datadog-cluster-agent-585897dc8d-x8l82
    hostname provider: aws
    unused hostname providers:
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

  Metadata
  ========

Leader Election
===============
  Leader Election Status:  Running
  Leader Name is: datadog-cluster-agent-585897dc8d-x8l82
  Last Acquisition of the lease: Thu, 19 Aug 2021 12:13:14 UTC
  Renewed leadership: Thu, 19 Aug 2021 15:28:07 UTC
  Number of leader transitions: 17 transitions

Custom Metrics Server
=====================
  External metrics provider uses DatadogMetric - Check status directly from Kubernetes with: `kubectl get datadogmetric`


Admission Controller
====================
  Disabled: The admission controller is not enabled on the Cluster Agent


=========
Collector
=========

  Running Checks
  ==============

    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.d/conf.yaml.default
      Total Runs: 787
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 660
      Service Checks: Last Run: 3, Total: 2,343
      Average Execution Time : 1.898s
      Last Execution Date : 2021-08-19 15:28:17.000000 UTC
      Last Successful Execution Date : 2021-08-19 15:28:17.000000 UTC

=========
Forwarder
=========

  Transactions
  ============
    Deployments: 350
    Dropped: 0
    DroppedOnInput: 0
    Nodes: 497
    Pods: 3
    ReplicaSets: 576
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Services: 263

  Transaction Successes
  =====================
    Total number: 3442
    Successes By Endpoint:
      check_run_v1: 786
      intake: 181
      orchestrator: 1,689
      series_v1: 786

==========
Endpoints
==========
  https://app.datadoghq.eu - API Key ending with:
      - f295b

=====================
Orchestrator Explorer
=====================
  ClusterID: f7b4f97a-3cf2-11ea-aaa8-0a158f39909c
  ClusterName: production
  ContainerScrubbing: Enabled
  ======================
  Orchestrator Endpoints
  ======================

  ===============
  Forwarder Stats
  ===============
    Pods: 3
    Deployments: 350
    ReplicaSets: 576
    Services: 263
    Nodes: 497

  ===========
  Cache Stats
  ===========
    Elements in the cache: 393
    Pods:
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 7 Miss: 5)
    Deployments:
      Last Run: (Hits: 36 Miss: 1) | Total: (Hits: 40846 Miss: 2444)
    ReplicaSets:
      Last Run: (Hits: 297 Miss: 1) | Total: (Hits: 328997 Miss: 19441)
    Services:
      Last Run: (Hits: 44 Miss: 0) | Total: (Hits: 49520 Miss: 2919)
    Nodes:
      Last Run: (Hits: 9 Miss: 0) | Total: (Hits: 10171 Miss: 755)```


and this is what i get from datadogmetric:

Name:         dcaautogen-2f116f4425658dca91a33dd22a3d943bae5b74
Namespace:    datadog
Labels:       <none>
Annotations:  <none>
API Version:  datadoghq.com/v1alpha1
Kind:         DatadogMetric
Metadata:
  Creation Timestamp:  2021-08-19T15:14:14Z
  Generation:          1
  Managed Fields:
    API Version:  datadoghq.com/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
      f:status:
        .:
        f:autoscalerReferences:
        f:conditions:
          .:
          k:{"type":"Active"}:
            .:
            f:lastTransitionTime:
            f:lastUpdateTime:
            f:status:
            f:type:
          k:{"type":"Error"}:
            .:
            f:lastTransitionTime:
            f:lastUpdateTime:
            f:message:
            f:reason:
            f:status:
            f:type:
          k:{"type":"Updated"}:
            .:
            f:lastTransitionTime:
            f:lastUpdateTime:
            f:status:
            f:type:
          k:{"type":"Valid"}:
            .:
            f:lastTransitionTime:
            f:lastUpdateTime:
            f:status:
            f:type:
        f:currentValue:
    Manager:         datadog-cluster-agent
    Operation:       Update
    Time:            2021-08-19T15:14:44Z
  Resource Version:  164942235
  Self Link:         /apis/datadoghq.com/v1alpha1/namespaces/datadog/datadogmetrics/dcaautogen-2f116f4425658dca91a33dd22a3d943bae5b74
  UID:               6e9919eb-19ca-4131-b079-4a8a9ac577bb
Spec:
  External Metric Name:  nginx.net.request_per_s
  Query:                 avg:nginx.net.request_per_s{kube_app_name:nginx}.rollup(30)
Status:
  Autoscaler References:  canary/hibob-hpa
  Conditions:
    Last Transition Time:  2021-08-19T15:14:14Z
    Last Update Time:      2021-08-19T15:53:14Z
    Status:                True
    Type:                  Active
    Last Transition Time:  2021-08-19T15:14:14Z
    Last Update Time:      2021-08-19T15:53:14Z
    Status:                False
    Type:                  Valid
    Last Transition Time:  2021-08-19T15:14:14Z
    Last Update Time:      2021-08-19T15:53:14Z
    Status:                True
    Type:                  Updated
    Last Transition Time:  2021-08-19T15:14:44Z
    Last Update Time:      2021-08-19T15:53:14Z
    Message:               Global error (all queries) from backend
    Reason:                Unable to fetch data from Datadog
    Status:                True
    Type:                  Error
  Current Value:           0
Events:                    <none>

this is my cluster agent deployment: 

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "18"
    meta.helm.sh/release-name: datadog
    meta.helm.sh/release-namespace: datadog
  creationTimestamp: "2021-02-05T07:36:39Z"
  generation: 18
  labels:
    app.kubernetes.io/instance: datadog
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: datadog
    app.kubernetes.io/version: "7"
    helm.sh/chart: datadog-2.7.0
  name: datadog-cluster-agent
  namespace: datadog
  resourceVersion: "164881216"
  selfLink: /apis/apps/v1/namespaces/datadog/deployments/datadog-cluster-agent
  uid: ec52bb4b-62af-4007-9bab-d5d16c48e02c
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: datadog-cluster-agent
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        ad.datadoghq.com/cluster-agent.check_names: '["prometheus"]'
        ad.datadoghq.com/cluster-agent.init_configs: '[{}]'
        ad.datadoghq.com/cluster-agent.instances: |
          [{
            "prometheus_url": "http://%%host%%:5000/metrics",
            "namespace": "datadog.cluster_agent",
            "metrics": [
              "go_goroutines", "go_memstats_*", "process_*",
              "api_requests",
              "datadog_requests", "external_metrics", "rate_limit_queries_*",
              "cluster_checks_*"
            ]
          }]
        checksum/api_key: something
        checksum/application_key: something
        checksum/clusteragent_token: something
        checksum/install_info: something
      creationTimestamp: null
      labels:
        app: datadog-cluster-agent
      name: datadog-cluster-agent
    spec:
      containers:
      - env:
        - name: DD_HEALTH_PORT
          value: "5555"
        - name: DD_API_KEY
          valueFrom:
            secretKeyRef:
              key: api-key
              name: datadog
              optional: true
        - name: DD_APP_KEY
          valueFrom:
            secretKeyRef:
              key: app-key
              name: datadog-appkey
        - name: DD_EXTERNAL_METRICS_PROVIDER_ENABLED
          value: "true"
        - name: DD_EXTERNAL_METRICS_PROVIDER_PORT
          value: "8443"
        - name: DD_EXTERNAL_METRICS_PROVIDER_WPA_CONTROLLER
          value: "false"
        - name: DD_EXTERNAL_METRICS_PROVIDER_USE_DATADOGMETRIC_CRD
          value: "true"
        - name: DD_EXTERNAL_METRICS_AGGREGATOR
          value: avg
        - name: DD_CLUSTER_NAME
          value: production
        - name: DD_SITE
          value: datadoghq.eu
        - name: DD_LOG_LEVEL
          value: INFO
        - name: DD_LEADER_ELECTION
          value: "true"
        - name: DD_COLLECT_KUBERNETES_EVENTS
          value: "true"
        - name: DD_CLUSTER_AGENT_KUBERNETES_SERVICE_NAME
          value: datadog-cluster-agent
        - name: DD_CLUSTER_AGENT_AUTH_TOKEN
          valueFrom:
            secretKeyRef:
              key: token
              name: datadog-cluster-agent
        - name: DD_KUBE_RESOURCES_NAMESPACE
          value: datadog
        - name: DD_ORCHESTRATOR_EXPLORER_ENABLED
          value: "true"
        - name: DD_ORCHESTRATOR_EXPLORER_CONTAINER_SCRUBBING_ENABLED
          value: "true"
        - name: DD_COMPLIANCE_CONFIG_ENABLED
          value: "false"
        image: gcr.io/datadoghq/cluster-agent:1.10.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /live
            port: 5555
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 15
          successThreshold: 1
          timeoutSeconds: 5
        name: cluster-agent
        ports:
        - containerPort: 5005
          name: agentport
          protocol: TCP
        - containerPort: 8443
          name: metricsapi
          protocol: TCP
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /ready
            port: 5555
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 15
          successThreshold: 1
          timeoutSeconds: 5
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/datadog-agent/install_info
          name: installinfo
          readOnly: true
          subPath: install_info
      dnsConfig:
        options:
        - name: ndots
          value: "3"
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: datadog-cluster-agent
      serviceAccountName: datadog-cluster-agent
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: datadog-installinfo
        name: installinfo
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2021-05-13T15:46:33Z"
    lastUpdateTime: "2021-05-13T15:46:33Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2021-02-05T07:36:39Z"
    lastUpdateTime: "2021-08-19T12:12:06Z"
    message: ReplicaSet "datadog-cluster-agent-585897dc8d" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 18
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

For the record i got this sorted.为了记录,我得到了这个排序。

According to the helm default values file you must set the app key in order to use metrics provider:根据 helm 默认值文件,您必须设置应用程序键才能使用指标提供程序:

  # datadog.appKey -- Datadog APP key required to use metricsProvider

  ## If you are using clusterAgent.metricsProvider.enabled = true, you must set
  ## a Datadog application key for read access to your metrics.
  appKey:  # <DATADOG_APP_KEY>

I guess this is a lack of information in the docs and also a check that is missing at the cluster-agent startup.我想这是文档中缺少信息,也是集群代理启动时缺少的检查。 Going to open an issue about it.打算打开一个关于它的问题。

From the official documentation on troubleshooting the agent here , you have:此处有关对代理进行故障排除的官方文档中,您有:

If you see the following error when describing the HPA manifest:如果您在描述 HPA 清单时看到以下错误:

 Warning FailedComputeMetricsReplicas 3s (x2 over 33s) horizontal-pod-autoscaler failed to get nginx.net.request_per_s external metric: unable to get external metric default/nginx.net.request_per_s/&LabelSelector{MatchLabels:map[string]string{kube_container_name: nginx,},MatchExpressions:[],}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get nginx.net.request_per_s.external.metrics.k8s.io)

Make sure the Datadog Cluster Agent is running, and the service exposing the port 8443, whose name is registered in the APIService, is up.确保 Datadog Cluster Agent 正在运行,并且公开端口 8443 的服务(其名称已在 APIService 中注册)已启动。

I believe the key phrase here is whose name is registered in the APIService .我相信这里的关键短语是whose name is registered in the APIService Did you perform the API Service registration for your external metrics service?您是否为您的外部指标服务执行了API Service注册? This source should provide some details on how to set it up. 来源应提供有关如何设置的一些详细信息。 Since you're getting 403 - Unauthorized errors, it simply implies the TLS setup is causing issues.由于您收到403 - Unauthorized错误,这仅意味着 TLS 设置导致了问题。

Perhaps you can follow the guide in general and ensure that your node-agent is functioning correctly and has token environment variable correctly configured.也许您可以按照一般指南进行操作,并确保您的节点代理正常运行并正确配置了token环境变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM