![](/img/trans.png)
[英]How to filter Spark dataframe if one column is a member of another column
[英]How to Filter a List in spark with another column of same dataframe(Version 2.2)
我需要在同一數據框中用另一列過濾列表。
下面是我的數據幀。 在這里,我想用 col1 過濾 col3 列表,並只為父級獲取活動的孩子。
Df.show(10,false):
=============================
Col1 Col2 col3 flag
P1 Parent [c1,c2,c3,c4] Active
c1 Child [] InActive
c2 Child [] Active
c3 Child [] Active
Expected Output :
===================
Df.show(10,false):
Col1 Col2 col3 flag
P1 Parent [c2,c3] Active
c2 Child [] Active
c3 Child [] Active
有人可以幫我得到上述結果。
我像這樣生成了你的數據框:
val df = Seq(("p1", "Parent", Seq("c1", "c2", "c3", "c4"), "Active"),
("c1", "Child", Seq(), "Inactive"),
("c2", "Child", Seq(), "Active"),
("c3", "Child", Seq(), "Active"))
.toDF("Col1", "Col2", "col3", "flag")
然后我只過濾一個數據框中的活動子項,這是輸出的一部分:
val active_children = df.where('flag === "Active").where('Col2 === "Child")
我還生成了一個帶有explode
的父/子關系的扁平數據框:
val rels = df.withColumn("child", explode('col3))
.select("Col1", "Col2", "flag", "child")
scala> rels.show
+----+------+------+-----+
|Col1| Col2| flag|child|
+----+------+------+-----+
| p1|Parent|Active| c1|
| p1|Parent|Active| c2|
| p1|Parent|Active| c3|
| p1|Parent|Active| c4|
+----+------+------+-----+
以及只有一列對應於活動子項的數據框,如下所示:
val child_filter = active_children.select('Col1 as "child")
並使用此child_filter
數據child_filter
過濾(通過連接)您感興趣的父母並使用 groupBy 將行聚合回您的輸出格式:
val parents = rels
.join(child_filter, "child")
.groupBy("Col1")
.agg(first('Col2) as "Col2",
collect_list('child) as "col3",
first('flag) as "flag")
scala> parents.show
+----+------+--------+------+
|Col1| Col2| col3| flag|
+----+------+--------+------+
| p1|Parent|[c2, c3]|Active|
+----+------+--------+------+
最后,聯合產生預期的輸出:
scala> parents.union(active_children).show
+----+------+--------+------+
|Col1| Col2| col3| flag|
+----+------+--------+------+
| p1|Parent|[c2, c3]|Active|
| c2| Child| []|Active|
| c3| Child| []|Active|
+----+------+--------+------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.