如何根据 Spark Scala 中其他数据框中的多列匹配过滤 dataframe

Question

Say I have three dataframes as follows:假设我有如下三个数据框：

  val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
  val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")

Here is tabular view:这是表格视图：

I want to filter df2 to only the rows where sport1 and sport2 combinations are valid rows of df1.我想将 df2 过滤为仅 sport1 和 sport2 组合是 df1 的有效行的行。 For example, since in df1, sport1 -> Run, sport2 -> Run is a valid row, it would return that as one of the rows from df2.例如，由于在 df1 中，sport1 -> Run、sport2 -> Run 是一个有效行，它将作为 df2 中的行之一返回。 It would not return sport1 -> Bike, sport2 -> Bike from df2 though.它不会从 df2 返回 sport1 -> Bike, sport2 -> Bike。 And it would not factor in what the 'name' column value is at all.而且它根本不会考虑“名称”列值是什么。

The expected result I'm looking for is the dataframe with the following data:我正在寻找的预期结果是具有以下数据的 dataframe：

+-------+------+------+
|name   |sport1|sport2|
+-------+------+------+
|kevin  |Run   |Run   |
|anthony|Fish  |Fish  |
+-------+------+------+

Thanks and have a great day!谢谢，祝你有美好的一天！

Answer 1

Try this,试试这个，

val res = df3.intersect(df1).union(df3.intersect(df2))

+------+------+
|sport1|sport2|
+------+------+
|   Run|   Run|
|  Fish|  Fish|
|  Swim|  Fish|
+------+------+

Answer 2

To filter a dataframe based on multiple column matches in other dataframes, you can use join :要根据其他数据框中的多列匹配过滤 dataframe，您可以使用join ：

df2.join(df1.select("sport1", "sport2"), Seq("sport1", "sport2"))

As by default join is an inner join, you will keep only the lines where "sport1" and "sport2" are the same in the two dataframes.由于默认情况下连接是内部连接，您将只保留两个数据框中“sport1”和“sport2”相同的行。 And as we use a list of columns Seq("sport1", "sport2") for the join condition, the columns "sport1" and "sport2" will not be duplicated当我们使用列列表Seq("sport1", "sport2")作为连接条件时，列 "sport1" 和 "sport2" 将不会重复

With your example's input data:使用示例的输入数据：

val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")

You get:你得到：

+------+------+-------+
|sport1|sport2|name   |
+------+------+-------+
|Run   |Run   |kevin  |
|Fish  |Fish  |anthony|
+------+------+-------+

如何根据 Spark Scala 中其他数据框中的多列匹配过滤 dataframe

问题描述

2 个解决方案

解决方案1
0 2020-10-28 16:21:04

解决方案2
0 2020-10-30 13:14:15

如何根据 Spark Scala 中其他数据框中的多列匹配过滤 dataframe

问题描述

2 个解决方案

解决方案1 0 2020-10-28 16:21:04

解决方案2 0 2020-10-30 13:14:15

解决方案1
0 2020-10-28 16:21:04

解决方案2
0 2020-10-30 13:14:15