简体   繁体   中英

Filter DF using the column of another DF (same col in both DF) Spark Scala

I am trying to filter a DataFrame DF1 using the column of another DataFrame DF2 , the col is country_id. I Want to reduce all the rows of the first DataFrame to only the countries that there are on the second DF. An example:

+--------------+------------+-------+
|Date          | country_id | value |
+--------------+------------+-------+
|2015-12-14    |ARG         |5      |
|2015-12-14    |GER         |1      |
|2015-12-14    |RUS         |1      |
|2015-12-14    |CHN         |3      |
|2015-12-14    |USA         |1      |


+--------------+------------+
|USE           | country_id |
+--------------+------------+
|  F           |RUS         |
|  F           |CHN         |

Expected:

+--------------+------------+-------+
|Date          | country_id | value |
+--------------+------------+-------+
|2015-12-14    |RUS         |1      |
|2015-12-14    |CHN         |3      |

How could I do this? I am new with Spark so I have thought on use maybe intersect? or would be more efficient other method?

Thanks in advance!

You can use left semi join:

val DF3 = DF1.join(DF2, Seq("country_id"), "left_semi")

DF3.show

//+----------+----------+-----+
//|country_id|      Date|value|
//+----------+----------+-----+
//|       RUS|2015-12-14|    1|
//|       CHN|2015-12-14|    3|
//+----------+----------+-----+

You can also use inner join:

val DF3 = DF1.alias("a").join(DF2.alias("b"), Seq("country_id")).select("a.*")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM