[英]How to filter a dataframe based on multiple column matches in other dataframes in Spark Scala
Say I have three dataframes as follows:假设我有如下三个数据框:
val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")
Here is tabular view:这是表格视图:
I want to filter df2 to only the rows where sport1 and sport2 combinations are valid rows of df1.我想将 df2 过滤为仅 sport1 和 sport2 组合是 df1 的有效行的行。 For example, since in df1, sport1 -> Run, sport2 -> Run is a valid row, it would return that as one of the rows from df2.例如,由于在 df1 中,sport1 -> Run、sport2 -> Run 是一个有效行,它将作为 df2 中的行之一返回。 It would not return sport1 -> Bike, sport2 -> Bike from df2 though.它不会从 df2 返回 sport1 -> Bike, sport2 -> Bike。 And it would not factor in what the 'name' column value is at all.而且它根本不会考虑“名称”列值是什么。
The expected result I'm looking for is the dataframe with the following data:我正在寻找的预期结果是具有以下数据的 dataframe:
+-------+------+------+
|name |sport1|sport2|
+-------+------+------+
|kevin |Run |Run |
|anthony|Fish |Fish |
+-------+------+------+
Thanks and have a great day!谢谢,祝你有美好的一天!
Try this,试试这个,
val res = df3.intersect(df1).union(df3.intersect(df2))
+------+------+
|sport1|sport2|
+------+------+
| Run| Run|
| Fish| Fish|
| Swim| Fish|
+------+------+
To filter a dataframe based on multiple column matches in other dataframes, you can use join
:要根据其他数据框中的多列匹配过滤 dataframe,您可以使用join
:
df2.join(df1.select("sport1", "sport2"), Seq("sport1", "sport2"))
As by default join is an inner join, you will keep only the lines where "sport1" and "sport2" are the same in the two dataframes.由于默认情况下连接是内部连接,您将只保留两个数据框中“sport1”和“sport2”相同的行。 And as we use a list of columns Seq("sport1", "sport2")
for the join condition, the columns "sport1" and "sport2" will not be duplicated当我们使用列列表Seq("sport1", "sport2")
作为连接条件时,列 "sport1" 和 "sport2" 将不会重复
With your example's input data:使用示例的输入数据:
val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")
You get:你得到:
+------+------+-------+
|sport1|sport2|name |
+------+------+-------+
|Run |Run |kevin |
|Fish |Fish |anthony|
+------+------+-------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.