简体   繁体   中英

spark structured streaming joining aggregate dataframe to dataframe

I have a streaming dataframe that could look at some point like:

+--------------------+--------------------+
|               owner|              fruits|
+--------------------+--------------------+
|Brian                | apple|
Brian                | pear |
Brian                | date|
Brian                | avocado|
Bob                | avocado|
Bob                | apple|
........
+--------------------+--------------------+

I performed a groupBy, agg collect_list to clean things up.

val myFarmDF = farmDF.withWatermark("timeStamp", "1 seconds").groupBy("fruits").agg(collect_list(col("fruits")) as "fruitsA")

the output is a single row for each owner and an array of every fruit. I would now like to join this cleaned up array to the original streaming dataframe dropping the fruits col and just having the fruitsA column

val joinedDF = farmDF.join(myFarmDF, "owner").drop("fruits")

this seems to work in my head, but spark doesn't seem to agree.

I get a

Failure when resolving conflicting references in Join:
'Join Inner
...
+- AnalysisBarrier
      +- Aggregate [name#17], [name#17, collect_list(fruits#61, 0, 0) AS fruitA#142]

When I turn everything into a static dataframe, it works just fine. Is this not possible in a streaming context?

Have you tried renaming the column name? There is a similar problem https://issues.apache.org/jira/browse/SPARK-19860

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM