简体   繁体   English

将collect()应用于Apache Spark结构化的流数据集

[英]Applying collect() to a Apache Spark structured streaming Dataset

I'm new to Apache Spark and currently working on a Structured Streaming pipeline. 我是Apache Spark的新手,目前正在使用结构化流水线。 In the middle of the data processing I need to do a bit of finnicky manipulation that requires that all of the data (so far) is present. 在数据处理的中间,我需要做一些细致的操作,要求(到目前为止) 所有数据都存在。 The amount of data has been heavily reduced at this point in the pipeline and performing a .collect() -like action will not be a bottleneck. 此时,管道中的数据量已大大减少,执行类似.collect()的操作将不会成为瓶颈。 The operation I need to perform is basically putting all remaining elements in a HashSet and doing a series of tricky existence checks. 我需要执行的操作基本上是将所有剩余元素放入HashSet中,并进行一系列棘手的存在性检查。 After this, I need to "re-enter" the streaming-pipeline to perform various writes to csv-files. 之后,我需要“重新输入”流传输管道以对csv文件执行各种写入操作。

However, attempting to perform collect() on a streaming pipeline understandably results in an error message. 但是,尝试在流传输管道上执行collect()导致错误消息。 Below is a barebones (and stupid) example that illustrates my problem: 下面是一个准系统(和愚蠢的)示例,它说明了我的问题:

// imports ...

val spark = SparkSession.builder
                        .appName("StructuredNetworkWordCount")
                        .getOrCreate()
val lines = spark.readStream
                 .format("socket")
                 .option("host", "localhost")
                 .option("port", 4444)
                 .load()

import spark.implicits._

// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))

// Won't work in a streaming context
val wordList = words.collectAsList()

// Perform some operations on the collected() data
val numWords = wordList.size
val doubledNum = numWords * 2

// Somehow output doubledNum
val query = wordCounts.writeStream
                      .outputMode("complete")
                      .format("console")
                      .start()

query.awaitTermination()

As I said, this will definitely not work, but illustrates my problem. 正如我所说,这绝对不会奏效,但可以说明我的问题。 I need to perform a collect() -like action in the middle of every microbatch in order to have simultaneous access to all data that is left. 我需要在每个微批处理的中间执行一个类collect()的操作,以便同时访问剩下的所有数据。 How would I go about doing this? 我将如何去做呢? Are accumulators the only way to access all the cumulative data in all partitions in the middle of a streaming pipeline? 累加器是访问流管道中间所有分区中所有累积数据的唯一方法吗?

Thanks! 谢谢!

首先,Spark结构流式传输返回DataFrame对象,它不支持map和flatMap方法,因此您可以使用foreach方法,在此方法中,您可以操纵输入流数据并使用counter来计数所有必需的元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM