Filter a dataframe using a list of tuples in spark scala

Question

I am trying to filter a dataframe in scala by comparing two of its columns (subject and stream in this case) to a list of tuples. If the column values and the tuple values are equal the row is filtered.

val df = Seq(
  (0, "Mark", "Maths", "Science"),
  (1, "Tyson", "History", "Commerce"),
  (2, "Gerald", "Maths", "Science"),
  (3, "Katie", "Maths", "Commerce"),
  (4, "Linda", "History", "Science")).toDF("id", "name", "subject", "stream")

Sample input:

+---+------+-------+--------+
| id|  name|subject|  stream|
+---+------+-------+--------+
|  0|  Mark|  Maths| Science|
|  1| Tyson|History|Commerce|
|  2|Gerald|  Maths| Science|
|  3| Katie|  Maths|Commerce|
|  4| Linda|History| Science|
+---+------+-------+--------+

List of tuple based on which the above df needs to be filtered

  val listOfTuples = List[(String, String)] (
    ("Maths" , "Science"),
    ("History" , "Commerce")
)

Expected result:

+---+------+-------+--------+
| id|  name|subject|  stream|
+---+------+-------+--------+
|  0|  Mark|  Maths| Science|
|  1| Tyson|History|Commerce|
|  2|Gerald|  Maths| Science|
+---+------+-------+--------+

Answer 1

You can either do it with isin with structs (needs spark 2.2+):

val df_filtered = df
    .where(struct($"subject",$"stream").isin(listOfTuples.map(typedLit(_)):_*))

or with leftsemi join:

val df_filtered = df
.join(listOfTuples.toDF("subject","stream"),Seq("subject","stream"),"leftsemi")

Answer 2

You can simply filter as

val resultDF = df.filter(row => {
  List(
    ("Maths", "Science"),
    ("History", "Commerce")
  ).contains(
    (row.getAs[String]("subject"), row.getAs[String]("stream")))
})

Hope this helps!

Filter a dataframe using a list of tuples in spark scala

Question

2 answers

solution1
3 ACCPTED 2019-09-26 15:52:55

solution2
1 2019-09-26 14:58:13

Filter a dataframe using a list of tuples in spark scala

Question

2 answers

solution1 3 ACCPTED 2019-09-26 15:52:55

solution2 1 2019-09-26 14:58:13

solution1
3 ACCPTED 2019-09-26 15:52:55

solution2
1 2019-09-26 14:58:13