Spark SQL: keep a non-key row after join

Question

I have two dataset as following:

smoothieDs.show()

|smoothie_id | smoothie     | price | 
|1           | Tropical     | 10    |
|2           | Green vegie  | 20    |

and:

ingredientDs.show()

|smoothie     | ingredient    | 
|Tropical     | Mango         | 
|Tropical     | Passion fruit | 
|Green veggie | Cucumber      |
|Green veggie | Kiwi          |

I want to join two datasets so that I could get ingredient information for each smoothie whose price is lower than 15$, but keep those even if the price is higher, and fill in with a string To be communicated for the ingredient field.

I tried smoothieDs.join(ingredientDs).filter(col(price).lt(15)) and it gives:

|smoothie_id  |price | smoothie     | ingredient    |
|1            |10    | Tropical     | Mango         | 
|1            |10    | Tropical     | Passion fruit |

But my expected result should be:

|smoothie_id  |price | smoothie     | ingredient         |
|1            |10    | Tropical     | Mango              | 
|1            |10    | Tropical     | Passion fruit      | 
|2            |20    | Green veggie | To be communicated |

Is it possible to achieve this using join directly, if not what is the best way to achieve this ?

Answer 1

You can replace the ingredient based on the price after the join:

import org.apache.spark.sql.functions._

smoothieDs.join(ingredientDs, "smoothie")
  .withColumn("ingredient", when('price.lt(15), 'ingredient).otherwise("To be communicated"))
  .distinct()
  .show()

Output:

+------------+-----------+-----+------------------+
|    smoothie|smoothie_id|price|        ingredient|
+------------+-----------+-----+------------------+
|Green veggie|          2|   20|To be communicated|
|    Tropical|          1|   10|             Mango|
|    Tropical|          1|   10|     Passion fruit|
+------------+-----------+-----+------------------+

Edit : another option would be to filter the ingredient dataset first and then do the join. This would avoid using distinct but comes at the price of a second join. Depending on the data this can or can not be faster.

smoothieDs.join(
  ingredientDs.join(smoothieDs.filter('price.lt(15)), Seq("smoothie"), "left_semi"),
  Seq("smoothie"), "left_outer")
  .na.fill("To be communicated", Seq("ingredient"))
  .show()

Answer 2

我不能最终确定您是否可以为此仅使用连接（也许可能，但令人费解？），但在我看来，您可以用价格超过 15 和低于 15 的冰沙将桌子分开，做每边所需的逻辑，然后将表重新组合在一起。

Spark SQL: keep a non-key row after join

Question

1 answers

solution1
1 ACCPTED 2021-07-18 21:35:11

solution2
0 2021-07-18 21:11:42

Spark SQL: keep a non-key row after join

Question

1 answers

solution1 1 ACCPTED 2021-07-18 21:35:11

solution2 0 2021-07-18 21:11:42

solution1
1 ACCPTED 2021-07-18 21:35:11

solution2
0 2021-07-18 21:11:42