简体   繁体   English

如何在与 Spark 相同的查询结果中使用聚合和连接?

[英]How can I use aggregate with join in the same query result with Spark?

I need to join to enrich my dataframe with postgres data.我需要加入以使用 postgres 数据丰富我的 dataframe。 In Spark Streaming I can do it normally because the data is processed in batch.在 Spark Streaming 中我可以正常执行,因为数据是批量处理的。 However, in structured streaming I'm getting an error whenever I try to use aggregate with join.但是,在结构化流中,每当我尝试将聚合与连接一起使用时,都会出现错误。

For example: If I use aggregate with the output Mode complete, the job works normally, however, if I add the join, it returns the error:例如:如果我使用聚合与 output 模式完成,作业正常工作,但是,如果我添加连接,它会返回错误:

Join between two streaming DataFrames/Datasets is not supported in Complete output mode, only in Append output mode;

The same happens if I do the opposite.如果我反其道而行之,也会发生同样的情况。 When I use the join with the output mode append the job runs normally, however, if I add the aggregate, the job returns the error:当我将连接与 output 模式 append 一起使用时,作业运行正常,但是,如果我添加聚合,作业将返回错误:

Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;

Finally, I would like to know if there is any way to use join and aggregate together without getting an error using spark structured streaming.最后,我想知道是否有任何方法可以将连接和聚合一起使用而不会使用火花结构化流式传输出现错误。

If so, what would an implementation look like that didn't generate that kind of error?如果是这样,一个不会产生那种错误的实现是什么样的?

def main(args: Array[String]): Unit = {
  val ss: SparkSession = Spark.getSparkSession
  val postgresSQL = new PostgresConnection
  val dataCollector = new DataCollector(postgresSQL)
  val collector = new Collector(ss,dataCollector)
  
  import ss.implicits._
  
  val stream: DataFrame = Kafka.setStructuredStream(ss)

  val parsed: DataFrame = Stream.parseInputMessages(stream)

  val getRelation: DataFrame = collector.getLastRelation(parsed)

    getRelation
    .writeStream.format("console")
    .trigger(Trigger.ProcessingTime(5000))
      .outputMode("complete")
      .queryName("Join")
      .start()

  ss.streams.awaitAnyTermination()
}

In my getLastRelation method I call a convertData method and a compareData method.在我的 getLastRelation 方法中,我调用了 convertData 方法和 compareData 方法。

    def getLastRelation(messageToProcess: DataFrame): DataFrame = {
    // Faz tratamentos no DF para preparar a busca
    val dss: Dataset[Message] = this.convertData(messageToProcess)
    val dsRelacaolista: Dataset[WithStructure] = this.getPersonStructure(ds)
    val compareData = this.compareData(dsRelacaolista,messageToProcess)
    compareData
}

In my convertData method I use an agg.在我的 convertData 方法中,我使用了一个 agg。

    def convertData(data: DataFrame): Dataset[Message] = {
    data.selectExpr("country","code","order")
      .groupBy($"country",$"order")
      .agg(collect_list("code")
        .as("code"))
      .as[Message]
  }

And in my compareData method I use join:在我的 compareData 方法中,我使用了 join:

def compareData(data: Dataset[WithStructure], message: DataFrame): DataFrame = {
    val tableJoin = message.selectExpr("order","order_id","hashCompare","created_at")

    data.toDF()
      .withColumn("hashCompare",hash($"country",$"code"))
      .join(tableJoin,"hashCompare")
  }

NOTE: I use scala as a language (I don't know if this information is important for this question)注意:我使用 scala 作为语言(我不知道此信息对这个问题是否重要)

If you're doing aggregating query on a stream you'd need to specify a watermark and window.如果您在 stream 上进行聚合查询,则需要指定水印和 window。

For example:例如:

 data
     .withWatermark("created_at", "10 minutes")
     .selectExpr("country","code","order")
     .groupBy(window($"created_at", "10 minutes", "5 minutes"), $"country",$"order")
     .agg(collect_list("code")
     .as("code"))
     .as[Message]

Data that arrive with your streams could be for any reason delayed (because of network slowdown etc).与您的流一起到达的数据可能出于任何原因延迟(由于网络速度变慢等)。 Watermarks allow specifying how long aggregation should wait for lagging events.水印允许指定聚合应等待滞后事件多长时间。 All events coming with delay higher than the time specified in the watermark will be ignored.所有延迟高于水印中指定时间的事件都将被忽略。

Append mode doesn't allow modifying previously outputted result. Append 模式不允许修改之前输出的结果。 Therefore it requires watermarks to ensure that aggregated data is not going to be updated any further.因此,它需要水印来确保聚合数据不会进一步更新。

You can choose a longer window for watermarking and it will give you higher tolerance for handling delayed data.您可以选择更长的 window 进行水印处理,它将为您处理延迟数据提供更高的容忍度。 The drawback is that upstream will be delayed by the watermark duration because the query has to wait for time specified in watermark to pass before aggregation can be finalized.缺点是上游将被水印持续时间延迟,因为查询必须等待水印中指定的时间通过才能完成聚合。

Additionally for stream-stream joins (when both sides of join are streaming datasets) you'd also need to specify window.此外,对于流-流连接(当连接的两边都是流数据集时),您还需要指定 window。

From docs:来自文档:

The challenge of generating join results between two data streams is that, at any point of time, the view of the dataset is incomplete for both sides of the join making it much harder to find matches between inputs.在两个数据流之间生成连接结果的挑战在于,在任何时间点,连接两侧的数据集视图都是不完整的,这使得在输入之间找到匹配变得更加困难。

Please check docs on doing stream-stream joins and on doing streaming aggregations .请查看有关进行流式连接进行流式聚合的文档。

It seems you cannot:(看来你不能:(

From Structured Streaming Programming Guide (emphasis mine):来自结构化流编程指南(强调我的):

Additional details on supported joins:有关支持的联接的其他详细信息:

  • Joins can be cascaded, that is, you can do df1.join(df2, ...).join(df3, ...).join(df4, ....).连接可以级联,也就是可以做df1.join(df2, ...).join(df3, ...).join(df4, ....)。

  • As of Spark 2.4, you can use joins only when the query is in Append output mode.从 Spark 2.4 开始,您只能在查询处于 Append output 模式时使用连接。 Other output modes are not yet supported.尚不支持其他 output 模式。

  • As of Spark 2.4, you cannot use other non-map-like operations before joins.从 Spark 2.4 开始,您不能在连接之前使用其他非地图类操作。 Here are a few examples of what cannot be used.以下是一些不能使用的示例。

    • Cannot use streaming aggregations before joins.在加入之前不能使用流式聚合。
    • Cannot use mapGroupsWithState and flatMapGroupsWithState in Update mode before joins.加入前不能在更新模式下使用 mapGroupsWithState 和 flatMapGroupsWithState。

Suggestion: An alternative might be do the aggregation in a separate stream and save it to a sink.建议:另一种方法可能是在单独的 stream 中进行聚合并将其保存到接收器。 Then read from that in a new stream to join with what you wanted.然后在新的 stream 中阅读,加入您想要的内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM