简体   繁体   中英

How can I use aggregate with join in the same query result with Spark?

I need to join to enrich my dataframe with postgres data. In Spark Streaming I can do it normally because the data is processed in batch. However, in structured streaming I'm getting an error whenever I try to use aggregate with join.

For example: If I use aggregate with the output Mode complete, the job works normally, however, if I add the join, it returns the error:

Join between two streaming DataFrames/Datasets is not supported in Complete output mode, only in Append output mode;

The same happens if I do the opposite. When I use the join with the output mode append the job runs normally, however, if I add the aggregate, the job returns the error:

Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;

Finally, I would like to know if there is any way to use join and aggregate together without getting an error using spark structured streaming.

If so, what would an implementation look like that didn't generate that kind of error?

def main(args: Array[String]): Unit = {
  val ss: SparkSession = Spark.getSparkSession
  val postgresSQL = new PostgresConnection
  val dataCollector = new DataCollector(postgresSQL)
  val collector = new Collector(ss,dataCollector)
  
  import ss.implicits._
  
  val stream: DataFrame = Kafka.setStructuredStream(ss)

  val parsed: DataFrame = Stream.parseInputMessages(stream)

  val getRelation: DataFrame = collector.getLastRelation(parsed)

    getRelation
    .writeStream.format("console")
    .trigger(Trigger.ProcessingTime(5000))
      .outputMode("complete")
      .queryName("Join")
      .start()

  ss.streams.awaitAnyTermination()
}

In my getLastRelation method I call a convertData method and a compareData method.

    def getLastRelation(messageToProcess: DataFrame): DataFrame = {
    // Faz tratamentos no DF para preparar a busca
    val dss: Dataset[Message] = this.convertData(messageToProcess)
    val dsRelacaolista: Dataset[WithStructure] = this.getPersonStructure(ds)
    val compareData = this.compareData(dsRelacaolista,messageToProcess)
    compareData
}

In my convertData method I use an agg.

    def convertData(data: DataFrame): Dataset[Message] = {
    data.selectExpr("country","code","order")
      .groupBy($"country",$"order")
      .agg(collect_list("code")
        .as("code"))
      .as[Message]
  }

And in my compareData method I use join:

def compareData(data: Dataset[WithStructure], message: DataFrame): DataFrame = {
    val tableJoin = message.selectExpr("order","order_id","hashCompare","created_at")

    data.toDF()
      .withColumn("hashCompare",hash($"country",$"code"))
      .join(tableJoin,"hashCompare")
  }

NOTE: I use scala as a language (I don't know if this information is important for this question)

If you're doing aggregating query on a stream you'd need to specify a watermark and window.

For example:

 data
     .withWatermark("created_at", "10 minutes")
     .selectExpr("country","code","order")
     .groupBy(window($"created_at", "10 minutes", "5 minutes"), $"country",$"order")
     .agg(collect_list("code")
     .as("code"))
     .as[Message]

Data that arrive with your streams could be for any reason delayed (because of network slowdown etc). Watermarks allow specifying how long aggregation should wait for lagging events. All events coming with delay higher than the time specified in the watermark will be ignored.

Append mode doesn't allow modifying previously outputted result. Therefore it requires watermarks to ensure that aggregated data is not going to be updated any further.

You can choose a longer window for watermarking and it will give you higher tolerance for handling delayed data. The drawback is that upstream will be delayed by the watermark duration because the query has to wait for time specified in watermark to pass before aggregation can be finalized.

Additionally for stream-stream joins (when both sides of join are streaming datasets) you'd also need to specify window.

From docs:

The challenge of generating join results between two data streams is that, at any point of time, the view of the dataset is incomplete for both sides of the join making it much harder to find matches between inputs.

Please check docs on doing stream-stream joins and on doing streaming aggregations .

It seems you cannot:(

From Structured Streaming Programming Guide (emphasis mine):

Additional details on supported joins:

  • Joins can be cascaded, that is, you can do df1.join(df2, ...).join(df3, ...).join(df4, ....).

  • As of Spark 2.4, you can use joins only when the query is in Append output mode. Other output modes are not yet supported.

  • As of Spark 2.4, you cannot use other non-map-like operations before joins. Here are a few examples of what cannot be used.

    • Cannot use streaming aggregations before joins.
    • Cannot use mapGroupsWithState and flatMapGroupsWithState in Update mode before joins.

Suggestion: An alternative might be do the aggregation in a separate stream and save it to a sink. Then read from that in a new stream to join with what you wanted.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM