繁体   English   中英

使用 spark scala 中的元组列表过滤 dataframe

[英]Filter a dataframe using a list of tuples in spark scala

我试图通过比较它的两列(在这种情况下为主题和 stream)来过滤 scala 中的 dataframe 到元组列表。 如果列值和元组值相等,则过滤行。

val df = Seq(
  (0, "Mark", "Maths", "Science"),
  (1, "Tyson", "History", "Commerce"),
  (2, "Gerald", "Maths", "Science"),
  (3, "Katie", "Maths", "Commerce"),
  (4, "Linda", "History", "Science")).toDF("id", "name", "subject", "stream")

样本输入:

+---+------+-------+--------+
| id|  name|subject|  stream|
+---+------+-------+--------+
|  0|  Mark|  Maths| Science|
|  1| Tyson|History|Commerce|
|  2|Gerald|  Maths| Science|
|  3| Katie|  Maths|Commerce|
|  4| Linda|History| Science|
+---+------+-------+--------+

需要过滤上述df的元组列表

  val listOfTuples = List[(String, String)] (
    ("Maths" , "Science"),
    ("History" , "Commerce")
)

预期结果:

+---+------+-------+--------+
| id|  name|subject|  stream|
+---+------+-------+--------+
|  0|  Mark|  Maths| Science|
|  1| Tyson|History|Commerce|
|  2|Gerald|  Maths| Science|
+---+------+-------+--------+

您可以使用带有结构的isin来做到这一点(需要 spark 2.2+):

val df_filtered = df
    .where(struct($"subject",$"stream").isin(listOfTuples.map(typedLit(_)):_*))

或 leftsemi 加入:

val df_filtered = df
.join(listOfTuples.toDF("subject","stream"),Seq("subject","stream"),"leftsemi")

您可以简单地filter

val resultDF = df.filter(row => {
  List(
    ("Maths", "Science"),
    ("History", "Commerce")
  ).contains(
    (row.getAs[String]("subject"), row.getAs[String]("stream")))
})

希望这可以帮助!

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM