简体   繁体   中英

Applying collect() to a Apache Spark structured streaming Dataset

I'm new to Apache Spark and currently working on a Structured Streaming pipeline. In the middle of the data processing I need to do a bit of finnicky manipulation that requires that all of the data (so far) is present. The amount of data has been heavily reduced at this point in the pipeline and performing a .collect() -like action will not be a bottleneck. The operation I need to perform is basically putting all remaining elements in a HashSet and doing a series of tricky existence checks. After this, I need to "re-enter" the streaming-pipeline to perform various writes to csv-files.

However, attempting to perform collect() on a streaming pipeline understandably results in an error message. Below is a barebones (and stupid) example that illustrates my problem:

// imports ...

val spark = SparkSession.builder
                        .appName("StructuredNetworkWordCount")
                        .getOrCreate()
val lines = spark.readStream
                 .format("socket")
                 .option("host", "localhost")
                 .option("port", 4444)
                 .load()

import spark.implicits._

// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))

// Won't work in a streaming context
val wordList = words.collectAsList()

// Perform some operations on the collected() data
val numWords = wordList.size
val doubledNum = numWords * 2

// Somehow output doubledNum
val query = wordCounts.writeStream
                      .outputMode("complete")
                      .format("console")
                      .start()

query.awaitTermination()

As I said, this will definitely not work, but illustrates my problem. I need to perform a collect() -like action in the middle of every microbatch in order to have simultaneous access to all data that is left. How would I go about doing this? Are accumulators the only way to access all the cumulative data in all partitions in the middle of a streaming pipeline?

Thanks!

首先,Spark结构流式传输返回DataFrame对象,它不支持map和flatMap方法,因此您可以使用foreach方法,在此方法中,您可以操纵输入流数据并使用counter来计数所有必需的元素。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM