Filter DF using the column of another DF (same col in both DF) Spark Scala

Question

I am trying to filter a DataFrame DF1 using the column of another DataFrame DF2 , the col is country_id. I Want to reduce all the rows of the first DataFrame to only the countries that there are on the second DF. An example:

+--------------+------------+-------+
|Date          | country_id | value |
+--------------+------------+-------+
|2015-12-14    |ARG         |5      |
|2015-12-14    |GER         |1      |
|2015-12-14    |RUS         |1      |
|2015-12-14    |CHN         |3      |
|2015-12-14    |USA         |1      |


+--------------+------------+
|USE           | country_id |
+--------------+------------+
|  F           |RUS         |
|  F           |CHN         |

Expected:

+--------------+------------+-------+
|Date          | country_id | value |
+--------------+------------+-------+
|2015-12-14    |RUS         |1      |
|2015-12-14    |CHN         |3      |

How could I do this? I am new with Spark so I have thought on use maybe intersect? or would be more efficient other method?

Thanks in advance!

Answer 1

You can use left semi join:

val DF3 = DF1.join(DF2, Seq("country_id"), "left_semi")

DF3.show

//+----------+----------+-----+
//|country_id|      Date|value|
//+----------+----------+-----+
//|       RUS|2015-12-14|    1|
//|       CHN|2015-12-14|    3|
//+----------+----------+-----+

You can also use inner join:

val DF3 = DF1.alias("a").join(DF2.alias("b"), Seq("country_id")).select("a.*")

Filter DF using the column of another DF (same col in both DF) Spark Scala

Question

1 answers

solution1
1 ACCPTED 2021-02-19 11:10:48

Filter DF using the column of another DF (same col in both DF) Spark Scala

Question

1 answers

solution1 1 ACCPTED 2021-02-19 11:10:48

solution1
1 ACCPTED 2021-02-19 11:10:48