简体   繁体   English

如何根据 Spark Scala 中其他数据框中的多列匹配过滤 dataframe

[英]How to filter a dataframe based on multiple column matches in other dataframes in Spark Scala

Say I have three dataframes as follows:假设我有如下三个数据框:

  val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
  val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")

Here is tabular view:这是表格视图:

在此处输入图像描述

I want to filter df2 to only the rows where sport1 and sport2 combinations are valid rows of df1.我想将 df2 过滤为仅 sport1 和 sport2 组合是 df1 的有效行的行。 For example, since in df1, sport1 -> Run, sport2 -> Run is a valid row, it would return that as one of the rows from df2.例如,由于在 df1 中,sport1 -> Run、sport2 -> Run 是一个有效行,它将作为 df2 中的行之一返回。 It would not return sport1 -> Bike, sport2 -> Bike from df2 though.它不会从 df2 返回 sport1 -> Bike, sport2 -> Bike。 And it would not factor in what the 'name' column value is at all.而且它根本不会考虑“名称”列值是什么。

The expected result I'm looking for is the dataframe with the following data:我正在寻找的预期结果是具有以下数据的 dataframe:

+-------+------+------+
|name   |sport1|sport2|
+-------+------+------+
|kevin  |Run   |Run   |
|anthony|Fish  |Fish  |
+-------+------+------+

Thanks and have a great day!谢谢,祝你有美好的一天!

Try this,试试这个,

val res = df3.intersect(df1).union(df3.intersect(df2))

+------+------+
|sport1|sport2|
+------+------+
|   Run|   Run|
|  Fish|  Fish|
|  Swim|  Fish|
+------+------+

To filter a dataframe based on multiple column matches in other dataframes, you can use join :要根据其他数据框中的多列匹配过滤 dataframe,您可以使用join

df2.join(df1.select("sport1", "sport2"), Seq("sport1", "sport2"))

As by default join is an inner join, you will keep only the lines where "sport1" and "sport2" are the same in the two dataframes.由于默认情况下连接是内部连接,您将只保留两个数据框中“sport1”和“sport2”相同的行。 And as we use a list of columns Seq("sport1", "sport2") for the join condition, the columns "sport1" and "sport2" will not be duplicated当我们使用列列表Seq("sport1", "sport2")作为连接条件时,列 "sport1" 和 "sport2" 将不会重复

With your example's input data:使用示例的输入数据:

val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")

You get:你得到:

+------+------+-------+
|sport1|sport2|name   |
+------+------+-------+
|Run   |Run   |kevin  |
|Fish  |Fish  |anthony|
+------+------+-------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM