简体   繁体   English

监控结构化流

[英]Monitoring Structured Streaming

I have a structured stream set up that is running just fine, but I was hoping to monitor it while it is running. 我有一个运行良好的结构化流设置,但我希望在它运行时监视它。

I have built an EventCollector 我已经构建了一个EventCollector

class EventCollector extends StreamingQueryListener{
  override def onQueryStarted(event: QueryStartedEvent): Unit = {
    println("Start")
  }

  override def onQueryProgress(event: QueryProgressEvent): Unit = {
    println(event.queryStatus.prettyJson)
  }

  override def onQueryTerminated(event: QueryTerminatedEvent): Unit = {
    println("Term")
  }

I have built an EventCollector and added the listener to my spark session 我已经构建了一个EventCollector并将监听器添加到我的spark会话中

val listener = new EventCollector()
spark.streams.addListener(listener)

Then I fire off the query 然后我解除了查询

val query = inputDF.writeStream
  //.format("console")
  .queryName("Stream")
  .foreach(writer)
  .start()

query.awaitTermination()

However, onQueryProgress never gets hit. 但是,onQueryProgress永远不会被击中。 onQueryStarted does, but I was hoping to get the progress of the query at a certain interval to monitor how the queries are doing. onQueryStarted确实如此,但我希望以一定的时间间隔获取查询的进度,以监控查询的执行情况。 Can anyone assist with this? 任何人都可以协助吗?

After much research into this topic, this is what I have found... 经过对这个主题的大量研究,这就是我发现的......

OnQueryProgress gets hit in between queries. OnQueryProgress在查询之间受到影响。 I am not sure if this intentional functionality or not, but while we are streaming data from a file, the OnQueryProgress does not fire off. 我不确定这是否有意,但是当我们从文件中传输数据时,OnQueryProgress不会启动。

A solution I have found was to rely on the foreach writer sink and perform my own analysis of performance within the process function. 我发现的解决方案是依靠foreach writer接收器并在过程函数中执行我自己的性能分析。 Unfortunately, we cannot access specific information about the query that is running. 遗憾的是,我们无法访问有关正在运行的查询的特定信息。 Or, I have not figured out how to yet. 或者,我还没想出怎么做。 This is what I have implemented in my sandbox to analyze performance: 这是我在沙箱中实现的分析性能的方法:

val writer = new ForeachWriter[rawDataRow] {
    def open(partitionId: Long, version: Long):Boolean = {
        //We end up here in between files
        true
    }
    def process(value: rawDataRow) = {
        counter += 1

        if(counter % 1000 == 0) {
            val currentTime = System.nanoTime()
            val elapsedTime = (currentTime - startTime)/1000000000.0

            println(s"Records Written:  $counter")
            println(s"Time Elapsed: $elapsedTime seconds")
        }
     }
}

An alternative way to get metrics: 获取指标的另一种方法:

Another way to get information about the running queries is to hit the GET endpoint that spark provides us. 获取有关正在运行的查询的信息的另一种方法是点击spark为我们提供的GET端点。

http://localhost:4040/metrics HTTP://本地主机:4040 /指标

or 要么

http://localhost:4040/api/v1/ HTTP://本地主机:4040 / API / V1 /

Documentation here: http://spark.apache.org/docs/latest/monitoring.html 这里的文档: http//spark.apache.org/docs/latest/monitoring.html

Update Number 2 Sept 2017: Tested on regular spark streaming, not structured streaming 2017年9月2日更新: 测试常规火花流,而非结构化流媒体

Disclaimer, this may not apply to structured streaming, I need to set up a test bed to confirm. 免责声明,这可能不适用于结构化流媒体,我需要设置一个测试床来确认。 However, it does work with regular spark streaming(Consuming from Kafka in this example). 但是,它确实适用于常规火花流(在此示例中从Kafka消费)。

I believe, since spark streaming 2.2 has been released, new endpoints exist that can retrieve more metrics on the performance of the stream. 我相信,由于Spark streaming 2.2已经发布,新的端点可以检索更多关于流性能的指标。 This may have existed in previous versions and I just missed it, but I wanted to make sure it was documented for anyone else searching for this information. 这可能已经存在于以前的版本中,我只是错过了它,但我想确保它是为其他任何搜索此信息的人记录的。

http://localhost:4040/api/v1/applications/ {applicationIdHere}/streaming/statistics http:// localhost:4040 / api / v1 / applications / {applicationIdHere} / streaming / statistics

This is the endpoint that looks like it was added in 2.2 (Or it already existed and was just added the documentation, I'm not sure, I haven't checked). 这是看起来像是在2.2中添加的端点(或者它已经存在并且刚刚添加了文档,我不确定,我还没有检查过)。

Anyways, it adds metrics in this format for the streaming application specified: 无论如何,它为指定的流应用程序添加了这种格式的指标:

{
  "startTime" : "2017-09-13T14:02:28.883GMT",
  "batchDuration" : 1000,
  "numReceivers" : 0,
  "numActiveReceivers" : 0,
  "numInactiveReceivers" : 0,
  "numTotalCompletedBatches" : 90379,
  "numRetainedCompletedBatches" : 1000,
  "numActiveBatches" : 0,
  "numProcessedRecords" : 39652167,
  "numReceivedRecords" : 39652167,
  "avgInputRate" : 771.722,
  "avgSchedulingDelay" : 2,
  "avgProcessingTime" : 85,
  "avgTotalDelay" : 87
}

This gives us the ability to build our own custom metric/monitoring applications using the REST endpoints that are exposed by Spark. 这使我们能够使用Spark公开的REST端点构建我们自己的自定义度量/监视应用程序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM