Prometheus Exporter - Direct Instrumentation vs Custom Collector

Question

I'm currently writing a Prometheus exporter for a telemetry network application.

I've read the doc here Writing Exporters and while I understand the use case for implementing a custom collector to avoid race condition, I'm not sure whether my use case could fit with direct instrumentation.

Basically, the network metrics are streamed via gRPC by the network devices so my exporter just receives them and doesn't have to effectively scrape them.

I've used direct instrumentation with below code:

I declare my metrics using promauto package to keep code compact:

package metrics

import (
    "github.com/lucabrasi83/prom-high-obs/proto/telemetry"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    cpu5Sec = promauto.NewGaugeVec(

        prometheus.GaugeOpts{
            Name: "cisco_iosxe_iosd_cpu_busy_5_sec_percentage",
            Help: "The IOSd daemon CPU busy percentage over the last 5 seconds",
        },
        []string{"node"},
    )

Below is how I simply set the metric value from the gRPC protocol buffer decoded message:

cpu5Sec.WithLabelValues(msg.GetNodeIdStr()).Set(float64(val))

Finally, here is my main loop which basically handles the telemetry gRPC streams for metrics I'm interested in:

for {

        req, err := stream.Recv()
        if err == io.EOF {
            return nil
        }
        if err != nil {
            logging.PeppaMonLog(
                "error",
                fmt.Sprintf("Error while reading client %v stream: %v", clientIPSocket, err))

            return err
        }

        data := req.GetData()

        msg := &telemetry.Telemetry{}

        err = proto.Unmarshal(data, msg)

        if err != nil {
            log.Fatalln(err)
        }

        if !logFlag {
            logging.PeppaMonLog(
                "info",
                fmt.Sprintf(
                    "Telemetry Subscription Request Received - Client %v - Node %v - YANG Model Path %v",
                    clientIPSocket, msg.GetNodeIdStr(), msg.GetEncodingPath(),
                ),
            )
        }
        logFlag = true

        // Flag to determine whether the Telemetry device streams accepted YANG Node path
        yangPathSupported := false

        for _, m := range metrics.CiscoMetricRegistrar {
            if msg.EncodingPath == m.EncodingPath {

                yangPathSupported = true
                go m.RecordMetricFunc(msg)
            }
        }
}

For each metric I'm interested in, I register it with a record metric function (m.RecordMetricFunc ) that takes the protocol buffer message as argument as per below.

package metrics

import "github.com/lucabrasi83/prom-high-obs/proto/telemetry"

var CiscoMetricRegistrar []CiscoTelemetryMetric

type CiscoTelemetryMetric struct {
    EncodingPath     string
    RecordMetricFunc func(msg *telemetry.Telemetry)
}

I then use an init function for the actual registration:



func init() {
    CiscoMetricRegistrar = append(CiscoMetricRegistrar, CiscoTelemetryMetric{
        EncodingPath:     CpuYANGEncodingPath,
        RecordMetricFunc: ParsePBMsgCpuBusyPercent,
    })
}

I'm using Grafana as the frontend and so far haven't seen any particular discrepancy while correlating the Prometheus exposed metrics VS Checking metrics directly on the device.

So I would like to understand whether this is following Prometheus best practices or I should still go through the custom collector route.

Thanks in advance.

Answer 1

You are not following best practices because you are using the global metrics that the article you linked to cautions against. With your current implementation your dashboard will forever show some arbitrary and constant value for the CPU metric after a device disconnects (or, more precisely, until your exporter is restarted).

Instead, the RPC method should maintain a set of local metrics and remove them once the method returns. That way the device's metrics vanish from the scrape output when it disconnects.

Here is one approach to do this. It uses a map that contains currently active metrics. Each map element is the set of metrics for one particular stream (which I understand corresponds to one device). Once the stream ends, that entry is removed.

package main

import (
    "sync"

    "github.com/prometheus/client_golang/prometheus"
)

// Exporter is a prometheus.Collector implementation.
type Exporter struct {
    // We need some way to map gRPC streams to their metrics. Using the stream
    // itself as a map key is simple enough, but anything works as long as we
    // can remove metrics once the stream ends.
    sync.Mutex
    Metrics map[StreamServer]*DeviceMetrics
}

type DeviceMetrics struct {
    sync.Mutex

    CPU prometheus.Metric
}

// Globally defined descriptions are fine.
var cpu5SecDesc = prometheus.NewDesc(
    "cisco_iosxe_iosd_cpu_busy_5_sec_percentage",
    "The IOSd daemon CPU busy percentage over the last 5 seconds",
    []string{"node"},
    nil, // constant labels
)

// Collect implements prometheus.Collector.
func (e *Exporter) Collect(ch chan<- prometheus.Metric) {
    // Copy current metrics so we don't lock for very long if ch's consumer is
    // slow.
    var metrics []prometheus.Metric

    e.Lock()
    for _, deviceMetrics := range e.Metrics {
        deviceMetrics.Lock()
        metrics = append(metrics,
            deviceMetrics.CPU,
        )
        deviceMetrics.Unlock()
    }
    e.Unlock()

    for _, m := range metrics {
        if m != nil {
            ch <- m
        }
    }
}

// Describe implements prometheus.Collector.
func (e *Exporter) Describe(ch chan<- *prometheus.Desc) {
    ch <- cpu5SecDesc
}

// Service is the gRPC service implementation.
type Service struct {
    exp *Exporter
}

func (s *Service) RPCMethod(stream StreamServer) (*Response, error) {
    deviceMetrics := new(DeviceMetrics)

    s.exp.Lock()
    s.exp.Metrics[stream] = deviceMetrics
    s.exp.Unlock()

    defer func() {
        // Stop emitting metrics for this stream.
        s.exp.Lock()
        delete(s.exp.Metrics, stream)
        s.exp.Unlock()
    }()

    for {
        req, err := stream.Recv()
        // TODO: handle error

        var msg *Telemetry = parseRequest(req) // Your existing code that unmarshals the nested message.

        var (
            metricField *prometheus.Metric
            metric      prometheus.Metric
        )

        switch msg.GetEncodingPath() {
        case CpuYANGEncodingPath:
            metricField = &deviceMetrics.CPU
            metric = prometheus.MustNewConstMetric(
                cpu5SecDesc,
                prometheus.GaugeValue,
                ParsePBMsgCpuBusyPercent(msg), // func(*Telemetry) float64
                "node", msg.GetNodeIdStr(),
            )
        default:
            continue
        }

        deviceMetrics.Lock()
        *metricField = metric
        deviceMetrics.Unlock()
    }

    return nil, &Response{}
}

Prometheus Exporter - Direct Instrumentation vs Custom Collector

Question

1 answers

solution1
5 ACCPTED 2019-08-01 11:36:11

Prometheus Exporter - Direct Instrumentation vs Custom Collector

Question

1 answers

solution1 5 ACCPTED 2019-08-01 11:36:11

solution1
5 ACCPTED 2019-08-01 11:36:11