[英]How to filter Scala Spark DataFrame if row matches ID in another DataFrame and timestamp is below the other frames timestamp
[英]How to filter a dataframe based on multiple column matches in other dataframes in Spark Scala
假設我有如下三個數據框:
val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")
這是表格視圖:
我想將 df2 過濾為僅 sport1 和 sport2 組合是 df1 的有效行的行。 例如,由於在 df1 中,sport1 -> Run、sport2 -> Run 是一個有效行,它將作為 df2 中的行之一返回。 它不會從 df2 返回 sport1 -> Bike, sport2 -> Bike。 而且它根本不會考慮“名稱”列值是什么。
我正在尋找的預期結果是具有以下數據的 dataframe:
+-------+------+------+
|name |sport1|sport2|
+-------+------+------+
|kevin |Run |Run |
|anthony|Fish |Fish |
+-------+------+------+
謝謝,祝你有美好的一天!
試試這個,
val res = df3.intersect(df1).union(df3.intersect(df2))
+------+------+
|sport1|sport2|
+------+------+
| Run| Run|
| Fish| Fish|
| Swim| Fish|
+------+------+
要根據其他數據框中的多列匹配過濾 dataframe,您可以使用join
:
df2.join(df1.select("sport1", "sport2"), Seq("sport1", "sport2"))
由於默認情況下連接是內部連接,您將只保留兩個數據框中“sport1”和“sport2”相同的行。 當我們使用列列表Seq("sport1", "sport2")
作為連接條件時,列 "sport1" 和 "sport2" 將不會重復
使用示例的輸入數據:
val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")
你得到:
+------+------+-------+
|sport1|sport2|name |
+------+------+-------+
|Run |Run |kevin |
|Fish |Fish |anthony|
+------+------+-------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.