使用 spark scala 中的元组列表过滤 dataframe

Question

我试图通过比较它的两列（在这种情况下为主题和 stream）来过滤 scala 中的 dataframe 到元组列表。 如果列值和元组值相等，则过滤行。

val df = Seq(
  (0, "Mark", "Maths", "Science"),
  (1, "Tyson", "History", "Commerce"),
  (2, "Gerald", "Maths", "Science"),
  (3, "Katie", "Maths", "Commerce"),
  (4, "Linda", "History", "Science")).toDF("id", "name", "subject", "stream")

样本输入：

+---+------+-------+--------+
| id|  name|subject|  stream|
+---+------+-------+--------+
|  0|  Mark|  Maths| Science|
|  1| Tyson|History|Commerce|
|  2|Gerald|  Maths| Science|
|  3| Katie|  Maths|Commerce|
|  4| Linda|History| Science|
+---+------+-------+--------+

需要过滤上述df的元组列表

  val listOfTuples = List[(String, String)] (
    ("Maths" , "Science"),
    ("History" , "Commerce")
)

预期结果：

+---+------+-------+--------+
| id|  name|subject|  stream|
+---+------+-------+--------+
|  0|  Mark|  Maths| Science|
|  1| Tyson|History|Commerce|
|  2|Gerald|  Maths| Science|
+---+------+-------+--------+

Answer 1

您可以使用带有结构的isin来做到这一点（需要 spark 2.2+）：

val df_filtered = df
    .where(struct($"subject",$"stream").isin(listOfTuples.map(typedLit(_)):_*))

或 leftsemi 加入：

val df_filtered = df
.join(listOfTuples.toDF("subject","stream"),Seq("subject","stream"),"leftsemi")

Answer 2

您可以简单地filter为

val resultDF = df.filter(row => {
  List(
    ("Maths", "Science"),
    ("History", "Commerce")
  ).contains(
    (row.getAs[String]("subject"), row.getAs[String]("stream")))
})

希望这可以帮助！

使用 spark scala 中的元组列表过滤 dataframe

问题描述

2 个解决方案

解决方案1
3 已采纳 2019-09-26 15:52:55

解决方案2
1 2019-09-26 14:58:13

使用 spark scala 中的元组列表过滤 dataframe

问题描述

2 个解决方案

解决方案1 3 已采纳 2019-09-26 15:52:55

解决方案2 1 2019-09-26 14:58:13

解决方案1
3 已采纳 2019-09-26 15:52:55

解决方案2
1 2019-09-26 14:58:13