[英]Filter a dataframe using a list of tuples in spark scala
我试图通过比较它的两列(在这种情况下为主题和 stream)来过滤 scala 中的 dataframe 到元组列表。 如果列值和元组值相等,则过滤行。
val df = Seq(
(0, "Mark", "Maths", "Science"),
(1, "Tyson", "History", "Commerce"),
(2, "Gerald", "Maths", "Science"),
(3, "Katie", "Maths", "Commerce"),
(4, "Linda", "History", "Science")).toDF("id", "name", "subject", "stream")
样本输入:
+---+------+-------+--------+
| id| name|subject| stream|
+---+------+-------+--------+
| 0| Mark| Maths| Science|
| 1| Tyson|History|Commerce|
| 2|Gerald| Maths| Science|
| 3| Katie| Maths|Commerce|
| 4| Linda|History| Science|
+---+------+-------+--------+
需要过滤上述df的元组列表
val listOfTuples = List[(String, String)] (
("Maths" , "Science"),
("History" , "Commerce")
)
预期结果:
+---+------+-------+--------+
| id| name|subject| stream|
+---+------+-------+--------+
| 0| Mark| Maths| Science|
| 1| Tyson|History|Commerce|
| 2|Gerald| Maths| Science|
+---+------+-------+--------+
您可以使用带有结构的isin
来做到这一点(需要 spark 2.2+):
val df_filtered = df
.where(struct($"subject",$"stream").isin(listOfTuples.map(typedLit(_)):_*))
或 leftsemi 加入:
val df_filtered = df
.join(listOfTuples.toDF("subject","stream"),Seq("subject","stream"),"leftsemi")
您可以简单地filter
为
val resultDF = df.filter(row => {
List(
("Maths", "Science"),
("History", "Commerce")
).contains(
(row.getAs[String]("subject"), row.getAs[String]("stream")))
})
希望这可以帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.