簡體   English   中英

如何根據 Spark Scala 中其他數據框中的多列匹配過濾 dataframe

[英]How to filter a dataframe based on multiple column matches in other dataframes in Spark Scala

假設我有如下三個數據框:

  val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
  val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")

這是表格視圖:

在此處輸入圖像描述

我想將 df2 過濾為僅 sport1 和 sport2 組合是 df1 的有效行的行。 例如,由於在 df1 中,sport1 -> Run、sport2 -> Run 是一個有效行,它將作為 df2 中的行之一返回。 它不會從 df2 返回 sport1 -> Bike, sport2 -> Bike。 而且它根本不會考慮“名稱”列值是什么。

我正在尋找的預期結果是具有以下數據的 dataframe:

+-------+------+------+
|name   |sport1|sport2|
+-------+------+------+
|kevin  |Run   |Run   |
|anthony|Fish  |Fish  |
+-------+------+------+

謝謝,祝你有美好的一天!

試試這個,

val res = df3.intersect(df1).union(df3.intersect(df2))

+------+------+
|sport1|sport2|
+------+------+
|   Run|   Run|
|  Fish|  Fish|
|  Swim|  Fish|
+------+------+

要根據其他數據框中的多列匹配過濾 dataframe,您可以使用join

df2.join(df1.select("sport1", "sport2"), Seq("sport1", "sport2"))

由於默認情況下連接是內部連接,您將只保留兩個數據框中“sport1”和“sport2”相同的行。 當我們使用列列表Seq("sport1", "sport2")作為連接條件時,列 "sport1" 和 "sport2" 將不會重復

使用示例的輸入數據:

val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")

你得到:

+------+------+-------+
|sport1|sport2|name   |
+------+------+-------+
|Run   |Run   |kevin  |
|Fish  |Fish  |anthony|
+------+------+-------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM