简体   繁体   中英

Spark SQL: keep a non-key row after join

I have two dataset as following:

smoothieDs.show()

|smoothie_id | smoothie     | price | 
|1           | Tropical     | 10    |
|2           | Green vegie  | 20    |

and:

ingredientDs.show()

|smoothie     | ingredient    | 
|Tropical     | Mango         | 
|Tropical     | Passion fruit | 
|Green veggie | Cucumber      |
|Green veggie | Kiwi          |

I want to join two datasets so that I could get ingredient information for each smoothie whose price is lower than 15$, but keep those even if the price is higher, and fill in with a string To be communicated for the ingredient field.

I tried smoothieDs.join(ingredientDs).filter(col(price).lt(15)) and it gives:

|smoothie_id  |price | smoothie     | ingredient    |
|1            |10    | Tropical     | Mango         | 
|1            |10    | Tropical     | Passion fruit | 

But my expected result should be:

|smoothie_id  |price | smoothie     | ingredient         |
|1            |10    | Tropical     | Mango              | 
|1            |10    | Tropical     | Passion fruit      | 
|2            |20    | Green veggie | To be communicated | 

Is it possible to achieve this using join directly, if not what is the best way to achieve this ?

You can replace the ingredient based on the price after the join:

import org.apache.spark.sql.functions._

smoothieDs.join(ingredientDs, "smoothie")
  .withColumn("ingredient", when('price.lt(15), 'ingredient).otherwise("To be communicated"))
  .distinct()
  .show()

Output:

+------------+-----------+-----+------------------+
|    smoothie|smoothie_id|price|        ingredient|
+------------+-----------+-----+------------------+
|Green veggie|          2|   20|To be communicated|
|    Tropical|          1|   10|             Mango|
|    Tropical|          1|   10|     Passion fruit|
+------------+-----------+-----+------------------+

Edit : another option would be to filter the ingredient dataset first and then do the join. This would avoid using distinct but comes at the price of a second join. Depending on the data this can or can not be faster.

smoothieDs.join(
  ingredientDs.join(smoothieDs.filter('price.lt(15)), Seq("smoothie"), "left_semi"),
  Seq("smoothie"), "left_outer")
  .na.fill("To be communicated", Seq("ingredient"))
  .show()

我不能最终确定您是否可以为此仅使用连接(也许可能,但令人费解?),但在我看来,您可以用价格超过 15 和低于 15 的冰沙将桌子分开,做每边所需的逻辑,然后将表重新组合在一起。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM