Spark流式自定义指标

Question

I'm working on a Spark Streaming program which retrieves a Kafka stream, does very basic transformation on the stream and then inserts the data to a DB (voltdb if it's relevant). 我正在开发一个Spark Streaming程序，它检索Kafka流，对流进行非常基本的转换，然后将数据插入到DB（如果相关则为voltdb）。 I'm trying to measure the rate in which I insert rows to the DB. 我正在尝试测量向DB插入行的速率。 I think metrics can be useful (using JMX). 我认为指标很有用（使用JMX）。 However I can't find how to add custom metrics to Spark. 但是我找不到如何向Spark添加自定义指标。 I've looked at Spark's source code and also found this thread however it doesn't work for me. 我查看了Spark的源代码并找到了这个帖子，但它对我不起作用。 I also enabled the JMX sink in the conf.metrics file. 我还在conf.metrics文件中启用了JMX接收器。 What's not working is I don't see my custom metrics with JConsole. 什么不起作用我没有看到我的自定义指标与JConsole。

Could someone explain how to add custom metrics (preferably via JMX) to spark streaming? 有人可以解释如何添加自定义指标（最好通过JMX）来激发流媒体？ Or alternatively how to measure my insertion rate to my DB (specifically VoltDB)? 或者如何测量我的数据库（特别是VoltDB）的插入速率？ I'm using spark with Java 8. 我在Java 8中使用spark。

Answer 1

Ok after digging through the source code I found how to add my own custom metrics. 在挖掘源代码后，我发现如何添加自己的自定义指标。 It requires 3 things: 它需要3件事：

Create my own custom source . 创建我自己的自定义源。 Sort of like this 有点像这样
Enable the Jmx sink in the spark metrics.properties file. 在spark metrics.properties文件中启用Jmx接收器。 The specific line I used is: *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink which enable JmxSink for all instances 我使用的具体行是： *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink ，为所有实例启用JmxSink
Register my custom source in the SparkEnv metrics system. 在SparkEnv指标系统中注册我的自定义源。 An example of how to do can be seen here - I actually viewed this link before but missed the registration part which prevented me from actually seeing my custom metrics in the JVisualVM 这里可以看到一个如何操作的例子 - 我之前实际上看过这个链接但错过了注册部分，这使我无法在JVisualVM中实际看到我的自定义指标

I'm still struggling with how to actually count the number of insertions into VoltDB because the code runs on the executors but that's a subject for a different topic :) 我仍在苦苦思索如何实际计算插入VoltDB的次数，因为代码在执行程序上运行，但这是一个不同主题的主题:)

I hope this will help others 我希望这会有助于其他人

Answer 2

Groupon have a library called spark-metrics that lets you use a simple (Codahale-like) API on your executors and have the results collated back in the driver and automatically registered in Spark's existing metrics registry. Groupon有一个名为spark-metrics的库，它允许您在执行程序上使用一个简单的（类似Codahale）API，并将结果整理回驱动程序中，并自动在Spark的现有指标注册表中注册。 These then get automatically exported along with Spark's built-in metrics when you configure a metric sink as per the Spark docs . 然后，当您根据Spark文档配置度量标准接收器时，这些会自动与Spark的内置指标一起导出。

Answer 3

to insert rows from based on inserts from VoltDB, use accumulators - and then from your driver you can create a listener - maybe something like this to get you started 从VoltDB插入基于插入的行，使用累加器 - 然后从您的驱动程序中创建一个监听器 - 也许这样就可以让你开始

sparkContext.addSparkListener(new SparkListener() {
  override def onStageCompleted(stageCompleted: SparkListenerStageCompleted) {
    stageCompleted.stageInfo.accumulables.foreach { case (_, acc) => {

here you have access to those rows combined accumulators and then you can send to your sink.. 在这里你可以访问那些行组合累加器，然后你可以发送到你的接收器..

Answer 4

here's an excellent tutorial which covers all the setps you need to setup Spark's MetricsSystem with Graphite. 这是一个很好的教程，涵盖了使用Graphite设置Spark的MetricsSystem所需的所有设置。 That should do the trick: 这应该够了吧：

http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/ http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/

Answer 5

Below is a working example in Java. 下面是Java中的一个工作示例。
It's tested with StreaminQuery (Unfortunately StreaminQuery does not have ootb metrics like StreamingContext till Spark 2.3.1). 它使用StreaminQuery进行测试（不幸的是， StreaminQuery没有像StreamingContext直到Spark 2.3.1之类的ootb指标）。

Steps: 脚步：

Define a custom source in the same package of Source class 在Source类的同一个包中定义自定义源

package org.apache.spark.metrics.source;

import com.codahale.metrics.Gauge;
import com.codahale.metrics.MetricRegistry;
import lombok.Data;
import lombok.experimental.Accessors;
import org.apache.spark.sql.streaming.StreamingQueryProgress;

/**
 * Metrics source for structured streaming query.
 */
public class StreamingQuerySource implements Source {
    private String appName;
    private MetricRegistry metricRegistry = new MetricRegistry();
    private final Progress progress = new Progress();

    public StreamingQuerySource(String appName) {
        this.appName = appName;
        registerGuage("batchId", () -> progress.batchId());
        registerGuage("numInputRows", () -> progress.numInputRows());
        registerGuage("inputRowsPerSecond", () -> progress.inputRowsPerSecond());
        registerGuage("processedRowsPerSecond", () -> progress.processedRowsPerSecond());
    }

    private <T> Gauge<T> registerGuage(String name, Gauge<T> metric) {
        return metricRegistry.register(MetricRegistry.name(name), metric);
    }

    @Override
    public String sourceName() {
        return String.format("%s.streaming", appName);
    }


    @Override
    public MetricRegistry metricRegistry() {
        return metricRegistry;
    }

    public void updateProgress(StreamingQueryProgress queryProgress) {
        progress.batchId(queryProgress.batchId())
                .numInputRows(queryProgress.numInputRows())
                .inputRowsPerSecond(queryProgress.inputRowsPerSecond())
                .processedRowsPerSecond(queryProgress.processedRowsPerSecond());
    }

    @Data
    @Accessors(fluent = true)
    private static class Progress {
        private long batchId = -1;
        private long numInputRows = 0;
        private double inputRowsPerSecond = 0;
        private double processedRowsPerSecond = 0;
    }
}

Register the source right after SparkContext is created 创建SparkContext后立即注册源

    querySource = new StreamingQuerySource(getSparkSession().sparkContext().appName());
    SparkEnv.get().metricsSystem().registerSource(querySource);

Update data in StreamingQueryListener.onProgress(event) 更新StreamingQueryListener.onProgress（event）中的数据

  querySource.updateProgress(event.progress());

Config metrics.properties 配置metrics.properties

*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=xxx
*.sink.graphite.port=9109
*.sink.graphite.period=10
*.sink.graphite.unit=seconds

# Enable jvm source for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource

Sample output in graphite exporter (mapped to prometheus format) 石墨输出器中的示例输出（映射到prometheus格式）

streaming_query{application="local-1538032184639",model="model1",qty="batchId"} 38
streaming_query{application="local-1538032184639",model="model1r",qty="inputRowsPerSecond"} 2.5
streaming_query{application="local-1538032184639",model="model1",qty="numInputRows"} 5
streaming_query{application="local-1538032184639",model=model1",qty="processedRowsPerSecond"} 0.81

Spark流式自定义指标

问题描述

5 个解决方案

解决方案1
15 已采纳 2015-10-01 09:12:00

解决方案2
5 2017-03-28 12:32:00

解决方案3
3 2015-11-07 23:55:20

解决方案4
2 2015-09-29 15:09:25

解决方案5
0 2018-09-27 08:30:06

Spark流式自定义指标

问题描述

5 个解决方案

解决方案1 15 已采纳 2015-10-01 09:12:00

解决方案2 5 2017-03-28 12:32:00

解决方案3 3 2015-11-07 23:55:20

解决方案4 2 2015-09-29 15:09:25

解决方案5 0 2018-09-27 08:30:06

解决方案1
15 已采纳 2015-10-01 09:12:00

解决方案2
5 2017-03-28 12:32:00

解决方案3
3 2015-11-07 23:55:20

解决方案4
2 2015-09-29 15:09:25

解决方案5
0 2018-09-27 08:30:06